3D Informatica, strong of more than ten-years relationaship with major companies in the national space-defense offers sophisticated systems of document management and workflow specific for the sector. All software are integrated and can be integrated within the existing solutions through the use of web services and have advanced research tools that are based in addition to the characteristics of Extraway information retrieval and on spectral analysis of documents and algorithms of term frequencies.
Below is a summary of the modules involved in the Space-Defense solutions.
The following Component Diagram, according to the UML notation, describes the general architecture of the technological solution adopted.
They have been identified the main components that constitute the system and how they interact between them.
With red color have been indicated components for which was made of software development.
Are substantially two main components that constituting the system:
- Extraway® Bridge: aims to monitor the sources RSS/Atom, interpret them, memorize the related item/entry in the document system and export them to other systems.
- Extraway Document Platform: aims to manage the item/entry of the RSS/Atom imported with a graphical interface for their consultation. Also modules were expected to be able to process such information, for example for the removal of duplicates and to perform an automatic content classification.
Extraway® Bridge is software that acts as a bridge between systems that produce data/documents and other systems that have been identified to store/manage such data/documents. It's a modular system based on the concept of connectors, which interact with each other.
The Extraway® Bridge operation can be described as the sequence of 3 distinct phases:
- Input: via the input connectors analyzing the sources data/documents.
- Elaboration: the data/documents identified in the sources are processed.
- Output: upon completion of the data/document processing are passed to the output stage that deals ---through special connectors--- to interact with other systems that will store/manage this information.
Extraway® Bridge It's a web application with a graphical interface for the configuration of sources and destinations and for monitoring of operations.
Without going into the details of the offered features we want to emphasize the native ability to consult sources to make manual imports or --through the definition of specific types combined with an agent timed-- to make automatic imports.
It's a connector of source type (input) for RSS/Atom flows, that allows the acquisition of the RSS/Atom feeds in multiple specified standard formats.
The main Java libraries have been evaluated and analyzed for reading flows RSS/Atom:
- Informa (does not support Atom and RSS 2.0 in the production of feed output).
- RSS4J (project practically no longer supported).
- RSSLibJ (even this does not seem supported).
- Rome (manage in input/output all existing formats).
The choice fell on the library Rome, state of the art in the java world.
The use of an external library for the proper management of the RSS/Atom was the mandatory. Over time they have followed too many variations and different format standard, generating a real chaos. Rome operates properly: RSS 0.90, RSS 0.91 Netscape, RSS 0.91 Userland, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, Atom 1.0.
It is also modular; one of these modules manages the standard GgeoRSS: management of the points in GeoRSS Simple format, W3CGeo and GeoRSS GML.
The module GeoRSS was adapted because not ran the tag "georss:elev".
Despite the use of the library Rome is in development and test phase some issues were addressed linked to availability and quality of the input data, the standards do not provide for mandadory on some main data of the item, standards on the format of the dates are often not respected and often are used incorrectly timestamp publishedDate of items and feeds; this led have to manage specific case studies.
Once the feed is processed and identified the item/entry that compose them, they are passed into a small module RSS/Atom Normalizer.
The normalizer module allows the conversion of various input formats RSS and Atom in a single format performing a real normalization.
This is done via the runtime production of an virtual feed and extracting the xml of the single item/entry in a manner that can be stored in the document system as atomic entities. Then XML of single/item entry enters the document system (XML native) AS IS and becomes xml portion of associated xml record.
In this way potentially any data xml of the item feed becomes a searchable item and manageable directly from the document system.
The normalization phase maintains the non-standard tags that were eventually inserted into item/entry and maintains the GeoRSS tags.
Docway Connector is an output connector for Extraway® Bridge which store in the document system the items/entries of RSS/Atom flows in entrance.
Each item/entry of examined feed enter the documentation system as an independent record, atomic, intact and self-consistent.
For each item is recorded the following set of information:
- publication date
- source (feed url)
- xml normalized (this also ensures tag maintaining of GeoRSS)
- first automatic categorization of news by source
The system is able to understand automatically, in phase of import, if must make a new entry or must perform an upgrade of the item/entry already stored ( this also in the case of feed for which the version of the standard used does not presuppose updating the item).
In practice have been identified cases that limit of this nature:
- same news change the timestamp of the publication date.
- sane news change the decription.
- feed lacking publication date.
- enter the feed atom lacking the publication date but with enhanced date update.
The Export Connector (output connector for Extraway® Bridge) allows to dump the item/entry of feed stored in output in xml format in some folder in a manner that they can be taken from an external system.
For example this channel can be exploited for feed export to the geotag engine, for geotagging in the case of item/entry devoid of localization information.
EXTRAWAY DOCUMENT PLATFORM
The document system is costituted by three different modules:
- 3diws: Web Service SOAP for interoperability with outside
- Docway4: Web application for document management
- eXtraway Server: XML native database. For data management
These are web services exposed natively from document platform.
Docway4 allows via a web interface consult the imported news.
It was developed a special form that implements a feed RSS/Atom pool to allow to make available outside in the standard format imported news.
The presence of different feeds allows the publication of news that have different temporal needs update.
It underlines tha fact that the delay introduced between the import phase of the item/entry and its pubblication in implemented feed in the system It's extremely reduced, estimated in minutes.
It's native XML database with advanced Information Retrieval and electronic records management in accordance with European standards MoReq.
It has been developed an extension of Extraway Server to allow the automatic classification of content. Being particularly interested component and high-tech he was dedicated a whole section.
When it has to do with a particularly large amount of documentary content the fact of being able to classify for the coherence of its organization and simplify retrieval and usability changes its nature from "opportunity" to "necessity".
If, in addition, the informative value of these contained documents is greater the more timely is their categorization, the use of automated tools becomes fundamental.
The goal is obvious, identify rapidly and with a low degree of error the classification to be assigned to a record on the basis of its contents.
HOW TO CLASSIFY
As the classification is done? Identifying inside the text component significant records of those terms, those concepts, that make the "similar" record to a particular item of classification1.
To achieve the above, there are two ways, two schools of thought: the semantic approach and the approach linguistic/statistical.
As it is intuitive, the semantic approach is what has, at least on paper, the ability to produce the best results but it suffers from some problems:
- The semantic interpretation algorithms put on the plate a dual behavior: when they are able to properly identify the meanings that lie in a text the results are usually very valid, but when this is not done the "noise" produced can be such as to nullify the entire proceedings.
- The above is all the more true when the texts to be processed are short or expressed in a synthetic language.
- The systems of semantic approach require rather extended processing times, providing the results that may be "late" compared to the needs.
The linguistic/statistical approach is much simpler. All terms of the texts are processed according to the normalized form of the lemma (for example, bringing all the verbal forms the infinitive form) and each is assessed in terms of their relevance.
Note Relevance: in a short, the term is even more significant when the greater its presence in a known subset in relation to its distribution outside of it. When a term is present with a given frequency in the subset but it is equally frequently outside of it the term is not considered relevant.
The relevance is calculated based on processes Tf-Idf function to skim off all terms which they are clearly not representative and the terms obtained are "crossed" with those identified for each classification item. At the end of this evaluation you will have a classification list, in order of "weight", is believed to be responsive to the specified record
Based on our experience, collection in the recent past in collaboration with experienced partners, the only semantic system can not be said to be effective and efficient. The two aspects are inversely proportional and it's not uncommon that the best results it will offer are blurred from the large "noise" product on multiple occasions and the processing time required.
In summary, the best solution would be a "mixed" solution, in which each valuation method can make a contribution to the identification of the most appropriate classification. With a careful tuning would be possible to get "the best of all possible outcomes". Yet this approach would result in an exponential growth in the complexity of architecture and further deterioration in performance terms.
Also it would remain the problem of breaking down "the noise effect" produced from semantic approach, operation that can be achieved by recognizing more credit to the linguistic detection. This at the same time demeans the semantic action because it runs the risk that to take away many "false positives" we then go to lose a "real positive" that only the semantic system has been able to bring out.
The use of mixed systems find application in moderate scenarios from being human, which examines the results, "moderate" them or or otherwise makes choices depending also and above all of his experience.
For these reasons, the choice of 3D Informatica, in particular for the project in question is to opt for the only linguistic/statistical approach.
SHORT DESCRIPTION OF PROCESS
We begin by saying that the process must first "learn."
To do this, you have to submit to the system a sets of records of which you already know the nature, so records that are classified (or classifiable) according to one of the established classification items.
The automatic classifier, in "learning" mode, identify which are the most representative terms of this subset of records, using processes based on Tf-Idf function. In this way it creates a coupling between such terms (and their calculated "weight") and corresponding classification voice. It then creates, what is called a target with all the identified terms and with an evaluation of the assumed importance by each term within subset of records represented by this target. The target can be enriched by additional terms defendants directly by experts that knowing this topic can nriched in this way the significant terms basin.
This process should of course be carried out for all classification items to create as many targets as are the items eligible.
A similar process affects the classification of new records. The textual content of the record that you want to classify is always processed on the basis of Tf-Idf function identifying the terms of greater relevance.
Once you identified these terms, they are compared to those belonging to different targets, with respect of weight in each of them and it is drawn up a list of eligible targets.
Here the scenarios differ much, in particular depending on the application solutions. Some predict and admit one and only one classification and this must be done carefully especially when the difference in weight between the relevant target is reduced. Other allow the use of more classification items for a single record and this makes it easier to make appropriate assignment especially when more voices are all relevant with a similar weight.
In interactive systems operator is given the opportunity to "accept" or not the proposed classification. The instrument is used as a guide rather than as a stand-alone tool. Nevertheless it is possible, once "run-in" the system determinate which are the weight thresholds beyond which the classification is automatically accepted.
The fully automatic systems need to choose the best classification (classifications), especially if a score is high discarding all those with a low score or in any case lower than a configured threshold. The reliability index of this choice will be greater, the greater it will be the calculated weight.
CONFIGURATION, TUNNING AND SELF-LEARNING
We conclude talking about the aspects of configuration and tuning.
As we have seen are several parameters that must be set, such as minimum and maximum weight, maximum number of classifications can be assigned to a single record, coefficients of weight increase of terms considered to be of greater importance, black list of terms to ignore, and so on.
Let us remember that this system is based on the learning process. Give life to the system with the initial training and then not make a constant "maintenance" will make it not very useful tool in the medium if not in the short period. That's why you should consider "Self learning" and its modality as it is evident that after an initial phase in which the system to be educated to recognize the records and their classifications it must continue to "grow" when new records flow in to the same system.
Here are some examples of how this process can take place:
- One-off, at set time intervals. It runs from scratch the evaluation of all target performing again the assessments on the relevance on records who have been part of the initial subset plus all those that have gone to add up over time (possibly including those of greater weight).
- Up the more that the records relate to the database, once classified, they contribute immediately to the recalculation of the values of relevance of each target. Any narrowing of this behavior in to only records whose weight exceeds a configured threshold.
Choose between these systems is not trivial because much depends on the content of the records themselves.
Let's say in general that the first approach is the one offering a more correct result but places on the other side of the scale the fact that each new contribution will not immediately become part of the system and then to enrich it.
The second approach is much more immediate and allows, with a lower computational burden proceeding by modeling the individual target as the database becomes populated. On the other hand, however each new document influence, both positively and negatively, also all other targets without this being detected.
To give an example: On a large database is added, in time, a number of records equal to a 10% of the original content. With the second approach the classifications most used "correct" own settings at the expense of the less used which could see their terms of growing importance redoing the assessment. In practice you risk that the classifications less frequent become less and less frequent. On the other hand, The first approach requires a much higher computational burden and it is not reasonable to apply it to each new record inserted into the database, especially if the archive populating frequency is high.
Once again the best choice to be made is probably the intermediate: apply a recalculation of relevances of a target when the elected record has got a very significant score, and subsequently make a total recalculation for all targets at time intervals not too extended.
In the recalculation of the weight of each target you can also make an assessment of chronological order. We know that the language evolves and more than 10 years old could be very significant today and thus help to distort the statistical evaluations performed by the classifier. Regardless of the composition of the subset which corresponds to a target can be established, for example, that only records the last 12 months are taken into consideration for weight evaluation.
An automatic classification system is somewhat similar to a body, a tree. It feeds, grows, and as this growth takes place it may be necessary to verify: the stem must be straight, there must be no branches and leaves that give a sign of receiving nourishment and the entire plant must be harmonic and not unbalanced.
The tree can not be left to itself. It must be controlled, if necessary, cared so that always gives good results.