Suda On Line (SOL) is a web-based project providing facilities to coordinate the effort of researchers in translating the Suda into English and to make the resulting electronic text searchable. The Suda is a 10th century Byzantine Greek historical encyclopedia of the ancient Mediterranean world in five volumes.
The paper discusses the various facilities provided in SOL and some related technical issues. These facilities include functions such as registering participants, allocating Suda encyclopedia entries to translators, storing translations, handling subsequent modification made to a translation either by its initial translator or vetting of editors, and searching translated entries. In addition, the SOL software allows actions such as altering participant profiles, viewing the current roster of participants and showing the activity log. Technical issues include the implementation of data stores, searching techniques, versioning of data, locking mechanisms, and security concerns.
The Suda is a 10th century Byzantine Greek historical encyclopedia of the ancient Mediterranean world, derived from the scholia to critical editions of canonical works and from compilations by yet earlier authors. It is considered an important text in Greek studies.
The original edition of the Suda was edited by Ada Adler and published in five volumes over the years 1928-1938[Adl38]. It is organized alphabetically, with entries numbered by the pair (first letter, sequence number), for example (beta, 13) and (xi, 32).
The researchers at The Thesaurus Linguae Graecae (TLG) created an electronic version of the Suda in 1982 based on the original text. The Greek text is encoded in Betacode [Tlg], which is an ASCII markup that contains alphabetic notations along with codes for switching between Greek and Roman fonts, punctuation, and formatting.
The Suda has never been translated into English. The goal of this project, called Suda On Line (SOL), is to support this translation effort and to make the resulting text searchable by topic, Adler number, and words.
SOL is a collaborative internet-based project. All facilities are provided through web-based interfaces supported by cgi-bin scripts I wrote in Perl. This approach allows us to center the work at one site. Ultimately, SOL aims to produces SGML-tagged databases and to provides links to other electronic resources such as the Perseus Project at Tufts University and the Thesaurus Linguae Graecae (TLG) at UC Irvine.
SOL has changed since this report was written. The original web site for SOL was http://www.cs.engr.uky.edu/~heng00/SUDA. Mukund Chandak, Ross Scaife, and Raphael Finkel have been improving and enhancing SOL 1. This document mainly describes the original version I implemented. Whenever the newer version is mentioned, I use the term "latest". The latest URL for SOL is http://www.stoa.org/sol.
Participants of SOL are classified into different levels of engagement called the "participation level". There are currently four participation levels: Managing editor, Editor, Translator, and Guest. Managing editors are the most privileged group, followed by editors, translators, and finally guests. Members of each level have access to facilities appropriate to that level and all the levels below it.
The following describes the SOL facilities available to each level.
The paper discusses the various facilities in order. It shows how to use these facilities, how I implemented them, and the reasons behind some of the implementations. The paper also examines the registration and login process as the entry point to SOL. Finally, I discuss the current limitations and some future enhancements.
To get involved in the SUDA project, participants must first register themselves. The web site for registration is http://www.cs.engr.uky.edu/~heng00/SUDA/register.html 2. The script register.pl supports the operations needed here.
A participant who wants to register needs to provide information including last name, first name, email address, desired login identification, password, favorite output format of Greek text, and a desired participation level. Currently, output format can be one of the following: "Betacode" for raw Betacode, "GreekXlit" for Latin transliteration, "Sgreek" for Microsoft Windows, "SMK GreekKeys" for Apple Macintosh, "Ismini" for X Windows, and "Unicode" for the Unicode standard.
The script register.pl prevents a new participant from using an existing login identification and email address by scanning through the participant database as well as the pending-registration database. The script also checks for the validity of the email address. It first checks for the existence of the "@" sign; then it validates the mail server given in the email address using nslookup.
If a participant chooses to be a "Guest", the registration is automatically approved upon submission. The participant can login immediately. The participant receives a welcome email from SOL. The community of managing editors receives an email announcing this newly approved registration.
In the case where the selected participation level is not "Guest", when a registration form is submitted, the information is recorded in the pending-registration database. At the same time, the participant receives a notification email, which includes the original registration information, indicating that the registration is being reviewed. In addition, the script sends a new registration-request message to all the managing editors. At this point, any managing editor can approve or disapprove the registration through a web form.
The web site for SOL login is http://www.cs.engr.uky.edu/~heng00/SUDA/login.html 3. This site includes a form that invokes login.pl.
Only participants who want to register with participation level higher than "Guest" need to wait for approval. Guests are automatically approved. A participant can log in after the registration is approved. The script login.pl validates and authenticates the participant's login by comparing the (login identification, password) pair with those stored in the participant database.
If login fails due to incorrect password entry, login.pl presents a form that contains an option requesting original registration information to be re-sent to the email address recorded.
If login is successful, the participant sees a list of tasks appropriate to the approved level of the participant. Often, the first script invokes other scripts. For example, translate_entry.pl invokes translate_main.pl and some other utility scripts to handle the task "Translate entries".
After a successful login, the login script generates a key "tied" to this participant. The script passes this key as a hidden web-form value from one form to another along with other hidden values including the participant's login id during the participant's SOL session. This key is used as a replacement for the participant's password. This key is a 32-byte character string generated using md5 [Riv92]. The information fed into md5 includes the participant's login id, some secret information (which makes it impossible for an outsider to generate valid keys), and the month of the year. This way, the key expires every month. The side effect of this approach is that if a participant is working from, say 11:00pm on May 31 to 2:00am on June 1, the key expires on 12:00am June 1 and a re-login is required. See "Security" for more about security issues.
Managing editors are the most trusted group in SOL. They oversee the well-being of SOL. Any decision made by a managing editor can potentially affect the entire project and hence is best done by consensus. The community of managing editors ought to be small and they should work together closely. Here is what a managing editor can do:
This feature allows a managing editor to become any participant of a lower level of participation. A managing editor is not allowed to become another managing editor. This feature allows a managing editor to carry out tasks on behalf of other participants in order to help, especially for those participants who are new to SOL.
A managing editor chooses a target participant from the web form presented and clicks on "proceed" to assume a new identity. The form is generated based on data in the participant database. The script disguise.pl takes care of this task. Once the request is submitted, a list of tasks appropriate to the newly assumed participant is presented as if the participant has just logged in. Managing editors may login again to resume their identity; clicking the back button in the web form also returns to their own identity.
This feature allows managing editors to send email to participants of different levels of engagement.
In the web form, a managing editor may single-select or multiple-select the target groups presented, fill in the subject and body of the email, and submit. The script sendemail.pl supports this operation.
We keep two versions of the translation database, the test version and the production version. The test version is for experimental purposes. Participants can manipulate the test database and familiarize themselves with SOL without committing any real work. This version is cleared out periodically. The production version is meant to be final and permanent.
This feature allows the managing editors to erase the test database when necessary. The form invokes clear_testdb.pl. This process is reflected in the activity log as well.
When a managing editor chooses this task, an assignment form is presented. The script approve_reg.pl handles functions needed here.
Each approved registration is kept as a separate record in the participant database, and the pending registrations are kept in the same way in the pending-registration database. When a managing editor decides to approve a pending registration, the related record is removed from the pending-registration database and added to the participant database. All disapproved registrations are removed permanently from the pending-registration database.
The script also logs this process in the activity log.
The script assign_entry.pl handles the functions needed here.
The Suda is organized alphabetically, with entries numbered by the pair (first letter, sequence number). In SOL, each of these pairs is referred to as an "Adler number".
The assignment form contains the following: (a) a list of all the participants whose level of engagement is above "Guest", (b) a list of all first letters, and (c) two boxes to hold the starting number and the ending number of a numeric range. Together (b) and (c) form an "Adler-number range". An example would be (phi:1-5). A managing editor records assignments by filling out and submitting this form. Assignment of a single entry can be achieved by using the same numbers in the Adler-number range, for instance, (delta:1170-1170).
We say an entry exists if its Adler number is properly defined. An entry is said to be available if it has not already been assigned. When a managing editor submits an assignment request, the script assign_entry.pl checks for the existence and the availability of the entries specified by the Adler-number range. Comparing the Adler-number range with a predefined Adler-number limit stored in the file greek_letter.limit checks the former. The latter is achieved by checking the corresponding bits in a related status-bitmap file. There is one status-bitmap file for each Greek letter. The name of the file holding the status bitmap is greek_letter_status.bitmap where greek_letter is any of the first letter in Greek such as "alpha", and "beta". For example, the alpha_status.bitmap file shows which entries starting with "alpha" have already been assigned and which have not. These status-bitmap files actually tell more than just that. They also indicate if an individual entry has been assigned, translated, modified, or vetted.
After verifying the existence and availability of the selected entries, the script sets the corresponding bits in the related status-bitmap file to indicate the range is now assigned and inserts a record in the assignment database to record who is assigned what entries. The target participant is said to "own" those entries once they are assigned. Finally, the script logs the activity in the activity log.
This form invokes the script reassign_entry.pl.
This feature allows previously assigned entries to be reassigned to another participant. The process of reallocating entries is similar to allocating new entries except for additional work required to reverse the previous changes.
Entries that are not previously assigned cannot be reassigned. No action is taken if the target owner of the reallocation is the same as the current owner.
Reassignment can potentially affect more than one existing participant since the given Adler-number range may include entries assigned to more than one participant. This task usually requires the following changes. First, the script updates the new ownership status of affected participants in the assignment database. In addition, the script resets the status of any previously translated entries in the corresponding status-bitmap files. Entries can be in one of the following statuses: newly assigned ("ASGN"), just translated ("TRANS"), modified by its translator ("MOD"), or vetted by editors ("VET"). The current status of an entry is based on the last action taken on that entry. All reassigned entries are marked as newly assigned ("ASGN") regardless of their current statuses. This implementation may sound dangerous for entries already translated or vetted but it is by design. We assume that managing editors always make the right decisions. The script also updates the completion database by removing any affected entries.
The script displays diagnostic messages of the result of the reassignment including participants affected before final commitment is made. However, managing editors should still proceed with care. The script sends email to all affected participants notifying them of their new ownership status. Finally, the script logs the activity in the activity log.
The scripts edit_entry.pl and edit_main.pl support the operations seen here.
After Suda entries are translated, editors can examine them. We call this process of examination "vetting". Vetting usually involves checking grammar, format, style and content.
The status of an entry can change. Editors can choose to review only entries with a particular status, or all translated entries. Based on the choice made, the script edit_entry.pl presents a list of available entries. Editors then proceed by choosing a single entry to work with.
There is no limit on the number of times a single translation can be vetted or modified. Because we keep all versions of modification and vetting, editors can choose to see the entire update history of a particular entry. By default, however, only the latest version of modification or vetting is shown. In the case of a newly translated entry, the translation itself is the latest version.
The script edit_main.pl imposes several checks on the reviews submitted. First, the entry cannot have an empty translation body or an empty list of keywords. This precaution avoids accidental deletion by editors.
Since SOL is web-based, it is possible that multiple participants update an entry simultaneously. For instance, an owner may modify an entry while an editor is reviewing it. Another circumstance would be two editors reviewing the same entry and trying to submit their changes. Therefore, we need to introduce some consistency control. Each time an entry is stored, a timestamp is attached. We pass this timestamp as a hidden value in the web form. When an entry is subsequently updated, we compare the latest timestamp of this entry with the timestamp in the web form. If there is a match, we can be sure (with a proper file locking mechanism) that no other updates have been made, so we are working on the latest version of the entry. Hence, we can safely update the entry. Otherwise, we know someone else has submitted an update to the entry, meaning the version in hand is stale and should therefore be discarded. If an update is discarded, the script displays messages indicating the update failure and offers the editor an option to review the updated version of the entry.
All vettings that are successfully submitted are logged in the activity log.
The script request_entry.pl supports allocation operations.
Translators can request entries they would like to translate by filling out this web form. The form primarily consists of a text box. A translator may specify desired entries in the text box and click "submit". When the form is submitted, the script generates email messages to all managing editors informing them about the request. A managing editor through a web form assigns the translator the appropriate entries. It is up to the discretion of the managing editors to assign the translators what they requested.
When a participant chooses this task, an entry-selection page is presented. The scripts translate_entry.pl and translate_main.pl handle the functions needed here.
On the entry-selection page, the participants are notified of what entries are assigned to them and which of those assigned entries have already been translated. Based on this information, a participant selects an entry and submits the request.
The script translate_entry.pl first verifies that the selected entry is indeed assigned to the participant. It also ensures that the selected entry has not been translated. It then scans through the Suda entry header file (sudahead.list) containing all the valid (Betacode, Adler number) pairs and retrieves the matching Betacode. With the help of the code-conversion utilities, the script converts the Betacode into the desired output format specified by the participant during registration. It then displays a translation form containing the headword and the text of the Suda entry in that format. The form also has radio buttons for choosing the overall type of entry, called the "word type" (which can be either "person", "place" or "other"), and text boxes for holding the translation, notes, bibliographic references, related internet addresses, and keywords. The script then invokes translate_main.pl.
Since the file size of the electronic Suda text is big (about 7 MB), I have created an index file to speed up the lookup of its text. The index file stores (Adler number, starting byte, ending byte) triplets for every entry found. When a script tries to retrieve some text associated with an Adler number, an underlying utility script Util_bin_search.pl first performs a binary search on the index file for the matching entry, subsequently uses the location information found to retrieve the corresponding text requested.
A participant fills in the translation form and proceeds to the preview and confirmation page.
If the submission is successful, the software adds the translation to the translation database, updates the completion database, and sets the matching bit in the related status-bitmap file to "TRANS" indicating the entry is now translated.
The completion database keeps all the information indicating entries that their owners have translated. One interesting question to ask is why keep the completion database if we can derive the same information by fetching the assigned entries from the assignment database and checking against either the translation database or checking against each related status-bitmap file one at a time. The obvious answer is speed. With the help of the completion database, we eliminate these expensive steps (see Figure: Relationships among the assignment database, status-bitmap file and completion database). The completion database allows us search by translator login ids (tid) and get a list of all entries the translator has translated. The caveat is that the completion database can get out of sync with the translations in the case of script failures 4.
All new translations are logged in the activity log.
The scripts modtrans_entry.pl and modtrans_main.pl support this feature.
This feature allows a translator to modify an entry after the initial translation.
Modification is always based on the latest version of the translation. There are three scenarios to determine the latest version of a translation: a) if no vetting or modification is made to the entry, the original translation is the latest version; b) if the last action taken on the entry was vetting, that vetted version becomes the latest version; c) if modification was the last action taken on the entry, that modified version is the latest version.
The first form lists out all the translated entries by a translator and prompts for entry selection. Once an entry is selected, the script modtrans_entry.pl presents a form containing the latest version of the translation.
The form not only shows the current version of the translation, but it also fills in the corresponding text boxes so the translator can easily modify the existing text rather than retype everything from scratch.
Just as in the vetting process, the version a translator is working on may be stale at the time of submission due to changes submitted by editors. In this case, the software notifies the translator and provides an option to modify the updated version.
If an entry has been vetted, subsequent modifications by its owner void the vetted status of the entry. The script displays a warning message when this situation happens.
The script logs all the successful entry modifications in the activity log.
Guests may search (handled by search.pl), view all entries under a single Greek letter (handled by list_by_grletr.pl), view all entries under a translator or editor (handled by list_by_tr_ed.pl), or view all entries (handled by listall.pl).
So far, the search function is still in its infancy. Boolean search is not supported. The search string is treated as merely a single pattern. However, the script search.pl supports searching based on a specific field such as Adler number, translated headword, translation, notes, keywords, bibliography, associated internet addresses, and word type.
Any time an entry is viewed, only the latest version is shown. The latest version of the entry could be the result of vetting or subsequent modifications made by its owner if it is not newly translated. The translation database is paragraph-delimited; each paragraph contains all information about that entry. The fact that we keep all the versions of entries in the translation database makes searching more tedious.
My implementation first uses Unix agrep to retrieve all paragraphs containing the search string and redirect them to a temporary file. The next step is to use Perl pattern matching to narrow the search.
If the search is based on a specific field, we augment the search string with the tag used in the database. For example, if a participant searches for a keyword "law", the search string becomes "<keyword.*?law.*?</keyword". Here ".*?"s are added to relax the search condition. All entries that match this pattern fulfill the search. However, as we said, the match can happen in an older version of the entry and therefore yield a result that is no longer valid. Retrieving the latest version involves some processing. We use the time stamp associated to each version as the indicator. After we get the latest version of this entry, we do the same matching test again. This time, if there is a match, we store it in a temporary Perl hash with the entry's Adler number as the key. If there is more than one match, we append the new entry to this hash.
To delimit paragraphs, different tools use different syntax. For Unix agrep, the delimiter needs to be "\n\n". For Perl patterns, the $INPUT_RECORD_SEPARATOR needs to be set to "\n\n" or "" to read in the whole paragraph at a time.
When it is time to output, we reformat and highlight the original search string in the results. The entries are presented in the order of ascending Adler number. We accomplish this part by sorting all the available hash keys and subsequently retrieve the hashed entries in the sorted key order.
Since some searches potentially return many entries, we have introduced paging. By default, twenty entries are shown in each page. Guests can change that setting from a drop-down box in the web form.
We use a similar approach to process the results of other view options. When a participant chooses to view all entries under a translator or an editor (both handled by list_by_tr_ed.pl), a name has to be entered as the search criterion. The script first finds out all the possible matches in the participant database by scanning through every last name and first name and retrieves these login ids. It then invokes search.pl with the search pattern set to either <tid>login_id<tid> in the case of a translator or <eid>login_id<eid> in the case of an editor. The script search.pl is smart enough to know that the returned results should not reveal these login ids and instead should replace them with the participants' full names. Finally, viewing all entries (handled by listall.pl) skips the pattern-matching stage; the script sorts, reformats, pages and outputs the results.
This form invokes the script view_activity_log.pl.
The log tells the activity history of SOL. All major actions are logged. These actions include approval of pending registrations, assignment and reassignment of entries, clearing of the test database, as well as entry translations, modifications and vettings.
By default, only activities taken place on the current day are shown. But participants can choose to see activities in the last 7 days or the last 30 days.
The script user_profile.pl supports this feature.
Participants are allowed to change all the information about themselves except login id, because login ids are used as keys in various databases in SOL.
This form invokes the script roster.pl.
When a guest selects this task, the script lists the current roster of participants by group. Each participant's information is taken from the stored profile in the participant database. Sensitive information such as participants' login identifiers are not shown.
The SOL software is written as a collection of Perl [WCS96] CGI scripts that support web forms. These scripts generate most of the forms used in SOL.
All databases are implemented as flat files in ASCII. This approach makes the data machine-independent and allows for field values with arbitrary length. In most of the files, records are either paragraph or new-line delimited; individual fields are delimited by some uncommonly used strings such as "||" and "&&", although such use might conflict with unexpected data input.
The participant database (user.dat) holds all the records of approved registered participants. The information stored in the database includes the participant's login id, password, name, email address, phone number, affiliated organization, preferred Greek display, and participant level. Login ids are used extensively in other databases. But when revealing participants' login ids is not desirable, as in the case of a search result, this database allows us to convert ids to names.
The pending-registration database (pend_reg.dat) records all the pending registrations submitted by new participants from the SOL registration page. This database contains all the registration information as well as the current status of the registration. All records have an initial status of "NEW". If a participant is approved, the record is removed from this database and added to the participant database. A record is marked as "INPROGRESS" if a managing editor decides that further information is needed for the approval and the registration is being examined. Records of disapproved participants are discarded.
The assignment database (trans_asgn.dat) carries information about Suda entry assignment. In other words, the database tells "who is assigned what". It has a record for each translator listing the Suda entries that translator has been assigned. Each record consists of a translator id and an entry list. An entry list is a "||"-delimited list of "Adler-number range" (this term is explained earlier), for instance, "delta:1170-1170||epsilon:3737-3737".
The completion database (trans_done.dat) is parallel to the assignment database. The difference is that the former has an entry for each translator indicating what has been translated, whereas the latter contains an entry for each translator indicating the Suda entries that translator has been assigned. This database uses the same format as the assignment database. Both of these databases are used to reflect the current assignment and completion status of Suda entries assigned to individual translators. These two databases are used as a guide every time a managing editor assigns or reassigns some entries or a translator chooses an entry to translate.
The translation database (trans.dat) is the heart of SOL. All translated entries, as well as subsequent modifications and vettings, are stored in this database. Custom ASCII markups similar to HTML tags such as "<keyword>texts</keyword>" indicate specific fields. Each Suda entry is contained in a paragraph. More precisely, all versions of translations, modifications, and vettings related to this entry are in a single paragraph. This arrangement allows us to fetch all the information related to this entry at once when needed. When an entry is updated, it is first removed from the database and later copied to the end of the file.
Each time an update is made to an entry, a timestamp, a translation sequence number, and a vetting sequence number are attached. The timestamp is used for consistency control to prevent simultaneous updates, as mentioned earlier.
Inside every entry, the tags <trans_en> </trans_en> mark each version of translation or modification (called a "translation sub-entry"), and the tags <vet_en> </vet_en> mark each version of vetting (called a "vetting sub-entry").
In a translation sub-entry, the translation sequence number indicates how many times this entry has been modified, and the vetting sequence number tells us the latest vetting number that is based on this version of translation. The vetting sequence number is initially set to 0 to indicate this version is not vetted. Subsequent vettings based on this version update this vetting sequence number.
In a vetting sub-entry, the translation sequence number tells which version of translation the vetting is based on and the vetting number tells how many times in total has this entry been vetted. With these mechanisms, we can trace the relationship between different versions of vettings and translations.
The status-bitmap files store the current statuses of all the Suda entries. There are as many status-bitmap files as there are letters in the Greek alphabet. These files are basically flat files in ASCII with each byte representing a Suda entry. To illustrate, the fourth byte in the file phi_status.bitmap represents the status of the fourth Suda entry starting with the first letter "phi", which is "phi:4". My implementation does not cover special entries such as "alpha:1a".
The activity log (activity_log), as its name implies, records activities. Each entry in this file consists of a date and time, the type of activity ("DATA" indicates activity related to entry assignments, translations, or vettings; "USER" indicates activity related to user-registration processing), the database involved ("REAL" for the permanent database; "TEST" for the trial database), the activity, and the participants involved.
There are several code-conversion utilities used to convert the raw Betacode text from the electronic version of Suda into appropriate output format such as Sgreek, SMK GreekKeys and Ismini. In addition to the conversion utilities, other utility scripts include those that provide common functionality used in this project. For example, Util_main.pl parses the form variables and prepares the header for HTML form output, Util_auth.pl authenticates activities requested by participants, and Util_lock implements the file-locking mechanism used throughout the entire SOL software.
Since there is no easy way to implement global constants in Perl (at least not without the use of Perl modules), several Constants_ files are used as a workaround.
SOL does not implement a high level of security. In the participation database, passwords are stored in cleartext. Every time a participant logs in, the participant's password is passed as clear text through the network. After the initial login, SOL uses a 32-byte identifier generated by md5 in place of the participant's password and passes it as a hidden form variable. Unfortunately, this identifier is also passed in cleartext. Therefore, a malicious user who is able to obtain this information along with the URL and can guess a participant id can potentially fool SOL into taking undesirable actions.
A secure connection to a secure server can be an alternative, but the overhead certainly outweighs the benefits in this case. A better approach would be to use time-limited cookies.
There are always limitations and room for improvement. Some of the limitations in SOL come from the design decisions, some from technical difficulties, and others are due to time constraints. The following illustrates these limitations and discusses some improvements.
The translation database is implemented as one big flat file in my implementation. One suggestion is to implement the translation database using smaller files, for example, one file per Greek letter. In many cases, such as insertion of translations, modifications, vettings, and searching with criteria confined to an individual letter (for example, a search on entries starting with the letter "gamma"), the smaller-files approach is desirable, because these operations can be performed faster with smaller files. Another advantage of using the smaller-file approach is less resource contention, because operations that involve entries of different letters can access the corresponding files simultaneously. However, there is a major drawback with the smaller-files approach in global searching. In order to do a global search, the scripts involved have to access all of these smaller files. This kind of operation can be very expensive since, for each time it is performed, as many as 28 file-open operations (because there are 28 letters) are required.
All the databases used for this project are in ASCII-markup format. There are indeed SGML tags. However, a Document Type Definition (DTD) defining how the markup tags should be interpreted is missing.
There are several ways to implement the file-locking mechanisms. For example, one way is to use Perl's flock (which calls the system flock(2)) or fcntl (which calls the system fcntl(2) to lock and unlock a file. This approach however requires that the owner of the cgi-bin scripts directory to be the same as the uid of the cgi-bin scripts when they run. Another approach is to use a temporary file treated as a lock to another file. In my implementation, since the uid of the cgi-bin scripts is "nobody" but the cgi-bin directory is owned by "heng00", the SOL scripts use the second approach 5. Those lock files mentioned are in a publicly accessible directory. Lock files have the problem of hanging around if there is a script failure, blocking other operations until we manually remove the lock files. The latest version ignores a lock file that is older than a minute or so.
Each time the script locks a file (database), a corresponding temporary file with a predefined name served as a lock to that file is created. This temporary file contains the process id. When the script unlocks the file, the temporary file is then removed. There are no shared locks; all locks are exclusive.
The way SOL scripts maintain the consistency of the database is by requesting the file locks in a sequential order, and hence all operations are serialized. The lock() function, which takes care of the locking of a file, always waits for a predetermined amount of time, for example 5 seconds, before reporting it fails to obtain a lock 6.
This project uses CGI scripts heavily. There are many cases that form-entry validation could be performed faster at the client side using JavaScript or VBScript. For example, using JavaScript on the client side to find out if a translator tries to submit a translation form with important fields missing is much faster than sending the information over the network and then invoking some CGI scripts on the server side to do the same checking. But the initial goal was to make the project as widely usable as possible. Using any of these scripts increases dependency on browser support. Nonetheless, this assumption may be wrong as time passes.
Two additional flags in each participant's record are not used. They are the "removable" flag and the "status" flag. The "removable" flag is intended to indicate whether a participant can be safely removed from the database. If a participant has not participated in any activity that results in data commitment, this record can be safely removed. The "status" flag is intended to allow managing editors to activate or deactivate a particular participant. Modifying the script "Util_auth.pl" to make use of these two flags enables these features.
There is a graphical tool in progress intended to be used to create image map representing different Suda entries with color codes indicating the status of each individual entries. This tool generates GIF images and creates image maps in HTML pages that correspond to different Suda entries. A click on different regions of the map invokes a specific action depending on the context of the operations (for example, editing that entry or translating the selected entry). Several scripts I have written make up this tool. However, due to time constraints, I failed to fully implement this tool and hence decided to drop the image-map feature.
There is no transaction control in SOL. Many operations are not atomic; failures can happen in the midst of a series of operations. This situation can be bad since certain operations involve several changes before they are considered complete. For instance, when a translation is submitted, both the completion database and the related status-bitmap file need to be updated. However, if the script fails after updating one of the databases, it leaves the overall databases in an inconsistent state. One suggestion to avoid this situation is to introduce a transaction log and simulate rollback and commit operations as most DBMSs do to guarantee the atomicity of operations and consistency of the databases.
There is no operation provided to view statistical data in SOL. One desirable display is to show the when entries are assigned and how long have they been idle (not translated). The data structure (i.e., entry lists) used in the assignment database does not allow us to provide this kind of information. In order to provide such information, a new data structure that keeps a separate record for each Suda entry is required since each entry can possibly be assigned at a different time.
The purpose of the code-conversion utilities is to convert texts encoded in Betacode into different Greek formats. David Smith of the Perseus Project at Tufts University developed the Greek-conversion filters for "GreekXlit", "Sgreek", "SMK GreekKeys", and "Ismini". Sean Redmond at New York University developed the Betacode-to-Unicode filter. However, due to the fact that the Betacode-to-Unicode conversion utility is still a "tool in progress", Unicode output format is not correctly implemented 7.
Finally, the package can be rewritten in a more object-oriented manner using Perl modules. In addition, many of the similar web-form processing operations found in the my software can be found in the Perl CGI package CGI.pm from CPAN.
1. I wrote the initial SOL software under the supervision of Raphael Finkel, with weekly discussions with Ross Scaife (Professor of Classics at UK). Mukund Chandak is the research assistant continuing the programming work of SOL currently.
2. The web site has changed in the latest version; this document describes the one implemented by the project. The latest URL for SOL registration is http://www.stoa.org/sol/sol_register.shtml.
3. The web site has changed in the latest version; this document describes the one implemented by the project. The new URL for login is http://www.stoa.org/sol.
4. Raphael Finkel has since written a tool in Perl to restore synchrony. It reads the translation database and rebuilds the completion database.
5. In the latest version, the CGI scripts are wrapped so they run with the uid "sol", which also owns all the relevant files and directories. So flock and fcntl would now work.
6. The script has been changed in the latest version to remove stale locks instead. When there are bugs in the cgi-bin script, and it terminates prematurely, it leaves a lock, and that lock can be very annoying unless we automatically remove it when it is clearly stale.
7. In the latest version, the problem has been fixed. It is now possible to convert Betacode to Unicode correctly.
Adl38 |
Ada Adler, editor. Suidae Lexicon. Verlag Teubner, Stuttgart, 1928-38. in 5 volumes. |
FSN98 |
Raphael A. Finkel, Ross Scaife, Huar-En Ng. The SUDA project: Collaborative Web-based Translation. http://www.cs.engr.uky.edu/~raphael/suda.paper/index.html, June 1998. |
PW98 |
Craig Patchett, Matthew Wright. The CGI/Perl Cookbook. John Wiley & Sons, 1998. |
Red98 |
Sean Redmond. Personal communication. Betacode-to-Unicode filter. New York University. |
Riv92 |
R.L. Rivest. RFC 1321: The MD5 Message-Digest Algorithm. Internet Activities Board, April 1992. http://www.faqs.org/rfcs/rfc1321.html. |
Smi98 |
David Smith. Personal communication. Greek-conversion filters for "GreekXlit", "Sgreek", "SMK GreekKeys", and "Ismini". Perseus Project at Tufts University. |
Tlg |
The Thesaurus Linguae Graecae (TLG), University of California, Irvine. The TLG Beta Code Manual. http://www.tlg.uci.edu/~tlg/BetaCode.html. |
WCS96 |
Larry Wall, Tom Christiansen, and Randal L. Schwartz. Programming Perl, 2nd Edition. O'Reilly & Associates, 1996. |