Enterprise Content Management

Armedia Blog

Mule and Activiti: Happy Together

March 27th, 2014 by David Milller

Activiti out-of-the-box supports Mule tasks. When Activiti encounters a Mule task, it launches the configured Mule flow. This integration is exactly why my new project uses Mule in the first place. But Activiti also supports execution and task listeners: code that runs in response to an Activiti event. How can I cause Activiti to launch a Mule flow in response to such events?

The short answer is: write a generic Spring bean to launch a Mule flow; then use Activiti’s expression language to call this bean. Example snippet from an Activiti XML configuration file:

<sequenceFlow id='flow1' sourceRef='theStart' targetRef='writeReportTask'>  
    <extensionElements>
        <activiti:executionListener      expression="${cmMuleActivitiListener.sendMessage('vm://activiti-flow', 'Allman Bros. Band',
'bandResults', execution)}"/>
    </extensionElements>
</sequenceFlow>

 

This snippet causes Activiti to launch the vm://activiti-flow Mule flow when the Activiti workflow transitions from theStart to writeReportTask.Allman Bros. Band is the message payload, and the Mule flow results are placed in the bandResults Activiti flow variable. execution is a reference to the currently executing flow, and allows the Spring beancmMuleActivitiListener to give the Mule flow access to the Activiti flow variables.

The Spring configuration file is very simple:

<?xml version="1.0" encoding="UTF-8"?>  
<beans xmlns="http://www.springframework.org/schema/beans"  
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.1.xsd">

    <description>
        This Spring configuration file must be loaded into an
        application context which includes a bean named 
        "cm-mule-client".  This cm-mule-client bean must be an
        org.mule.module.client.MuleClient.
    </description>

    <!-- NOTE: bean ids to be used in Activiti expression language must be proper Java identifiers -->
    <bean id="cmMuleActivitiListener" class="com.armedia.acm.activiti.ActivitiMuleListener">
        <property name="realMuleClient" ref="cm-mule-client"/>
    </bean>
</beans>

Please note the description. The project must provide a Spring bean named cm-mule-client.

Finally, the primary method in the Spring Java class is also pretty straightforward:

 private void sendMessage(MuleClient mc, String url, Object payload, String resultVariable, DelegateExecution de)
            throws MuleException
    {

        Map<String, Object> messageProps = new HashMap<String, Object>();
        setPropertyUnlessNull(messageProps, "activityId",
          de.getCurrentActivityId());
        setPropertyUnlessNull(messageProps, "activityName",
          de.getCurrentActivityName());
        setPropertyUnlessNull(messageProps, "eventName",
          de.getEventName());
        setPropertyUnlessNull(messageProps, "id", de.getId());
        setPropertyUnlessNull(messageProps, "parentId",
          de.getParentId());
        setPropertyUnlessNull(messageProps, "processBusinessKey",
          de.getProcessBusinessKey());
        setPropertyUnlessNull(messageProps,
          "processDefinitionId", de.getProcessDefinitionId());
        setPropertyUnlessNull(messageProps,
          "processVariablesMap", de.getVariables());
        setPropertyUnlessNull(messageProps, "processInstanceId",
          de.getProcessInstanceId());

        MuleMessage mm = mc.send(url, payload, messageProps);

        log.debug("Mule sent back: " + mm.getPayload());

        if ( resultVariable != null )
        {
            de.setVariable(resultVariable, mm.getPayload());
        }
    }

    private void setPropertyUnlessNull(
      Map<String, Object> map, String key, Object value)
    {
        if ( value != null )
        {
            map.put(key, value);
        }
    }

 

This code sends the Mule message and waits for Mule to reply (since we call the Mule client send method). To post the message asynchronously we just need a corresponding method to call the Mule client dispatchmethod.

This small module allows me to configure Activiti to invoke Mule flows in response to Activiti events. And Activiti ships with the ability to invoke Mule flows as automated tasks. So now I have pervasive Mule support in Activiti: wherever Activiti supports executable code, it can call Mule. So happy together!

Documentum Storage Decision Points – SAN versus NAS

March 25th, 2014 by Bill Hunton

When it comes to in-house IT services, every company has different practices.  We will stick to the basics, unless our company’s core product is cutting edge hardware or software.  Usually businesses follow the pack, and their system architects color within the lines.

For several years there has been the discussion around SAN versus NAS storage.  What is the best solution?  The simple non-answer is “It depends.”   We need to consider several things while designing the storage architecture to support a solution, or working with the customer’s storage engineer to design it.

The decision points are common among all software solutions.  However, the software products themselves may move you toward a particular hardware and storage solution.  Documentum is no different than Alfresco, and it is no different than FileNet and others regarding the questions to ask.

There may be different storage requirements for products within a vendor’s suite, and the solution “stack” may dictate the storage solution for all the products deployed in a system.

Regarding Documentum is there a “best solution?”   We will address concerns,  provide you questions to ask, and present a couple of recommendations.

As a friend of mine once said, “Indecision is the key to flexibility,” but eventually you have to order hardware.

NAS versus SAN

We will not go deeply into SAN and NAS storage definitions.  However, here is a very simplified view of NAS and SAN configurations:

NAS-Storage-Basic-Definition

 

SAN Storage

 

 

NAS provides both storage and a file server.  The NAS unit has an operating system.  The application host, in our case Documentum, communicates with NAS over the network using file based protocols such as NFS and CIFS to read and write the content located on the file system.  To the host operating system, NAS appears as a file server, that provides drive and share mapping.   An example is EMC Celerra.

SAN storage is also connected to the network, connected to a SAN switch with connections to the client hosts.  Blocks of storage appear the same as local storage disks.  SAN appears as a disk to the operating system.  There are management utilities such as Veritas to make it accessible.  SAN protocols include Fiber Channel, iSCSI, and ATA over Ethernet (AoE).  An example is EMC Symmetrix or CLARiiON.

Storage protocols can significantly affect price and performance.  For example an iSCSI SAN may cost multiple times an ATA.  So why pick iSCSI?  Well, iSCSI is much faster and generally more reliable than ATA.  Fiber Channel is faster still.  However, iSCSI uses less expensive Ethernet switches and cables, where Fiber Channel requires more specialized components.  What do you need to meet your requirements – operating distance, support staff skills, available budget?

In years past, the arguments about SAN versus NAS were more dichotomous than today.  The cost, features, and performance are different, but with hybrid configurations we can get efficiency and performance.  Differences still exist if you choose one versus the other.

Generalizations are dangerous, because there are exceptions, protocols change rapidly and blur the lines and turn todays best recommendation into tomorrow’s dog.   So, anticipating comments to the contrary, here ya go:  Consider SAN to be faster, and consider NAS lower cost and easier to maintain.

What about CAS (Content Addressable Storage)?  CAS is used mainly for archiving content, especially large amounts of content. The storage unit contains a CPU.  The storage address for each content file is computed based on an algorithm using the actual characters in the file.   CAS offers security features and supports retention policies.  We’ll discuss CAS at another time.  However, there are NAS and even SAN solutions that can be used in concert with Documentum products for archiving and retention policy compliance.

What Works Well in Documentum

Let’s think about what Documentum is and does, what your business requirements are, and why you might choose SAN over NAS and vice versa.

The rule of thumb is that file I/O is the critical path to any single processing thread.  However, we need to consider all kinds of performance and latency issues; for example,  remote users accessing centralized repositories, network bandwidth, file transfer protocols, peer to peer protocol layers, application design, host resources, and other such.  However, reading and writing data to and from disk, with the necessary transfer of packets across the wires is the single most time critical process.   Any single component out of whack will degrade performance.   However, if all else is well, then file I/O performance will impact speed directly.

Content Server

What does Documentum do?   For a moment, let’s forget the EMC IIG product suites and stacks.  The basic, over-simplified answer is that Documentum manages content files in various formats and it collects and stores data about each one.  That means it must transmit data over the wires to and from a relational database.  Second, it must transmit files over the wires, to and from storage.

Generally speaking, Documentum can use either SAN or NAS storage for content files.  NAS may be best choice if you require sharing by multiple content servers.

Full Text Index

One of the features of Documentum is full text indexing.  Documentum specifically recommended against NAS with their old Verity and FAST full text indexing integrations.  The reason is that the content server is already dealing with the basic file and data communications, and that NAS puts additional load on the entire process and negatively impacted throughput as it copied content and created the index.  Full text indexing even now is run on a separate dedicated host.  With both FAST and xPlore, you can search on metadata in the full text index, as well as for specific text.  The underlying data is in XML.

Even with great improvements in NAS performance, we would recommend to use SAN storage with full text indexing.

xCP

Documentum xCP is suite of products and a platform for designing and building process driven applications.  There is considerable file activity between the product components and database.  We would recommend SAN.

Database

Database performance yields Documentum performance.  Query design and tuning in custom applications as well as managing indexes, query plans, and statistics with the out of box product are mandatory for good performance.   SAN versus NAS is an important conversation.   However, one of our largest clients uses SAN with Oracle.

Oracle, EMC2 and others continue improving storage designs.  For example, Oracle now recommends their Direct NFS (dNFS) with release 11g.  Oracle has integrated NFS into its product.  Oracle accesses NFS storage and communicates with it directly within Oracle and not via the host operating system.

Decision Criteria

What are some questions to ask, and things to consider?

Cost

The objective is to get maximum performance at minimum total cost.  Here are a few things to consider besides unit cost:

  • Existing contract with the storage vendor
  • Licensing fees
  • Internal  versus vendor support
  • Discounts on storage and software bundles (Hmm.  EMC has storage solutions as well as Documentum.)
  • External versus internal Cloud storage solutions
  • Tiered storage solution based on retention policy – lower cost slower storage for archiving?
  • Do you want to add CAS to the mix or apply the storage you know to your Documentum archiving solution?

Performance

Do you really need sub-second response times from your content management system?   Performance is in the eye of the beholder, usually the end user.  Consider design recommendations unrelated to hardware:

  • Separate your CMS from content presentation.
  • Separate physically or by process your content authoring from publishing channels
  • Pre-publish content that is to be consumed
  • Archive “old” content to reduce query and response times.
  • Documentum custom types utilizing different default storage locations
  • Distributed stores to bring content physically closer to the consumer

Summary

“It depends” is the operative phrase when deciding what kind of storage you want to purchase for your Documentum system, or any other application.  Different Documentum products have unique storage considerations.   When designing your system, consider costs other than the direct storage price and build efficiencies into the architecture from the ground up.

SAN versus NAS is still a valuable discussion to have in spite of rapid improvements in technology.  They continue to converge.  Hybrid systems offer performance and cost savings.  Be careful of Documentum product requirements, but also use Documentum features to take advantage of storage technology and savings.

Armedia Case Management: Pluggable Authentication Modules with Spring

March 20th, 2014 by David Milller

Over the past several months, I have written several blogs about my project, the Armedia Case Management framework. In this post, I will cover how we have gotten Armedia Case Management to work with Spring Security. It is important to note that each organization has their own user directory, their own authentication rules, their own way of doing things.  Spring Security supports every conceivable scheme, so starting with Spring Security is a no-brainer.  But how can my project support configuring Spring Security at runtime?

How Other Products Do It

Alfresco supports authentication configuration at initial time of deployment [1].  The administrator configures properties files to support one or more authentication chains [2].  Alfresco reads these properties files at startup time.  When users login, Alfresco tries to authenticate against each link in the authentication chain.  Alfresco ships with support for NTLM [3], passthru [4], LDAP [5] (both non-Active Directory and Active Directory), Kerberos [6], and external authentication.  This approach is obviously very good seeing how widely-used Alfresco is, but leaves some of Spring Security’s reach and flexibility behind.

JIRA supports authentication configuration at runtime.  After initial installation, JIRA has one admin user.  That user logs in and uses the web user interface to configure one or more user directories [7].  The directory support is more limited: JIRA internal directory [8], Active Directory or other LDAP servers [9], Atlassian Crowd or another JIRA server [10].  As with Alfresco you can stack up as many directories as you need, even multiple Active Directory configurations.  This approach must also be very good since JIRA may be more widely used than Alfresco.  But it still leaves some of Spring Security behind.

Armedia Case Management Authentication Goals

For Armedia Case Management, I want to be able to support most any authentication scheme out of the box; again, that’s why we’re using Spring Security.  But I don’t want to specify a new property file system, like Alfresco, or write a web user interface like JIRA.  Such work is very exacting, requires a backbreaking regime of testing, and never quite gets you all the way to complete Spring Security coverage.

Not that Alfresco’s and JIRA’s approaches are wrong; I’ll be very happy when Armedia Case Management achieves the tiniest fraction of their success.  Still, I want to try something different.

Armedia Case Management should be able to read Spring Security configurations at runtime, passing authentication requests to each such Spring Security configuration.  Once such a configuration approves the authorization request, the user is logged in.  If they all reject the request, the user sees the error message from the last configuration.

How It Works

My solution does have a few moving pieces.

  1. A folder watcher that raises events when Spring configuration files are added, updated, or removed in the system configuration folder
  2. A Spring context manager that adds, updates, or removes Spring child contexts based on the content of such Spring config files
  3. A custom AuthenticationManager [11]that iterates over a set of AuthenticationProviders [12], as described above.  It gets the authentication providers by asking the Spring context holder for all AuthenticationProvider instances.

Here’s a simple activity diagram showing how an authentication provider is added:

AddUserDirectory_ArmediaCaseManagement

And here’s a simple activity diagram showing how an authentication provider is removed:

RemoveUserDirectory_ArmediaCaseManagement

And finally, here’s a diagram showing the user authentication process:

UserLogin_ArmediaCaseManagement

Results

This solution definitely meets my goals.  It supports any Spring Security authentication provider, does not require new configuration file formats (it uses the well-known Spring configuration file), and requires no Web user interface.  It does not even require the initial administrator user!

The solution does have some problems though.  Only the Spring configuration files are loaded.  This means any classes required by the Spring configuration must already be available.  And this means we have to ship the web application with every Spring Security module!  Since we can’t know beforehand which module any specific customer may need.

These problems actually point to their own solution.  In the future I want to extend this mechanism to be a full-fledged plugin system.  Many popular Web applications are based on plugins: Alfresco and JIRA (mentioned above), also Jenkins [13] and many others.  These plugin systems allow users to add custom classes, behavior, and user interface elements at runtime.  JIRA and Jenkins don’t even require restarts!  So I want Armedia Case Management to be the same.   I will write more blogs documenting my progress.

References

  1. Alfresco Help: Setting up Alfresco Authentication and Security
  2. Alfresco: Authentication Configuration Examples
  3. Alfresco: Configuring alfrescoNtlm
  4. Alfresco: Configuring pass thru
  5. Alfresco: Configuring LDAP
  6. Alfresco: Configuring Kerberos
  7. JIRA: Configuring User Directories
  8. JIRA: Configuring the Internal Directory
  9. JIRA: Connecting to an LDAP Directory
  10. JIRA: Connecting to Atlassian Crowd
  11. Spring IO: Custom AuthenticationManager
  12. Spring IO: Custom AuthenticationProvider
  13. Jenkins: Open Source Integration Server

Expanding Documentum's Full Text Search Capability with a Thesaurus

March 19th, 2014 by Scott Roth

An approach for enhancing a customers’ satisfaction with Documentum’s built-in full text search capabilities is to provide them with a thesaurus of terms relevant to their industry, region, or business process.  For example, suppose a user needs to find all the invoices in their repository for the soda products they ordered last year.  In some parts of the country, ‘pop’ is an acceptable alternative to ‘soda’.  Therefore, your search must equate these two terms, as well as expand them to contain the names of actual products.  A simple approach for implementing this capability is to build a thesaurus that contains an ontology of soda products.

In this example, let’s suppose a user searched on the words ‘soda’ and ‘invoice’, expecting to see results for Pepsi, Coke-Cola, Dr. Pepper, and Mt. Dew.  The search engine, as part of its preparation for executing the query, searches the thesaurus for ‘soda’ and automatically includes ‘Pepsi, ‘Coke-Cola’, ‘Dr. Pepper’, and ‘Mt. Dew’ as search terms in the query.  Now the user gets the results they expected.

Documentum’s full text search engine is EMC’s xPlore and, among other cool things, it implements thesauri using the Simple Knowledge Organizational System (SKOS) representation.  Once a thesaurus is created and installed in xPlore[1], it can be used to expand search terms with synonyms to perform broader searches.  SKOS can represent far more complex relationships than xPlore currently uses, but building the search engine on a representation like SKOS, positions xPlore for much greater and advanced types of searching in the future.

Building an xPlore SKOS thesaurus is as simple as writing an XML file containing SKOS elements.  There are really on three SKOS tags you need to know:

  • Concept – the concept is the idea you want to expand by including additional search terms in your query.  In this example, the concept is ‘soda’.
  • prefLabel – this is the preferred form of the term to be added to the query, i.e., the synonym for the concept.
  • altLabel – this is an alternate form of the term that can be added to query.  altLabels often include abbreviations or alternate spellings of the prefLabel values.

With these SKOS elements, a simple xPlore thesaurus for this example might look like this:

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns# xmlns:skos="http://www.w3.org/2004/02/skos/core#">
<skos:Concept rdf:about="http://www.my.com/#soda">
<skos:prefLabel>Coke-Cola</skos:prefLabel>
<skos:prefLabel>Pepsi</skos:prefLabel>
<skos:prefLabel>Dr. Pepper</skos:prefLabel>
<skos:prefLabel>Mt. Dew</skos:prefLabel>
</skos:Concept>
<skos:Concept rdf:about="http://www.my.com/#pop">
<skos:prefLabel>Coke-Cola</skos:prefLabel>
<skos:prefLabel>Pepsi</skos:prefLabel>
<skos:prefLabel>Dr. Pepper</skos:prefLabel>
<skos:prefLabel>Mt. Dew</skos:prefLabel>
</skos:Concept>
</rdf:RDF>

 

Figure 1 – Sample xPlore Thesaurus for ‘soda’ and ‘pop’

As you can see in Figure 1, the thesaurus file contains two Concepts, ‘soda’ and ‘pop’.  The body of each Concept element contains the names of the soda products you want to include in your search whenever ‘soda’ or ‘pop’ is expanded.  These product names are represented as prefLabel elements in the XML.  In this case, the ‘soda’ and ‘pop’ Concepts contain the same set of expansion terms because ‘soda’ and ‘pop’ are synonyms and we want to ensure the same expansion is executed for both terms.

As an example of using the altLabel tag, consider that ‘coke’ is often used as an abbreviated form for Coke-Cola, a specific product, as well as for cola, a general class of soda products.  To ensure the term ‘coke’ is properly expanded to include ‘Coke-Cola’ as well as ‘cola’, the Concept depicted in Figure 2 can be added to the thesaurus.

<skos:Concept rdf:about="http://www.my.com/#coke">
<skos:prefLabel>Coke-Cola</skos:prefLabel>
<skos:altLabel>cola</skos:altLabel>
</skos:Concept>

 

Figure 2 – Concept Element for ‘coke’

Using a thesaurus can be a simple, yet powerful way to enhance users’ search experiences.  Because thesauri in xPlore are XML, they can be quickly and easily modified to meet the changing needs of users, industries, or business processes.  This means search experiences and results can be evaluated immediately without any reindexing of content or recompilation of search code.  Therefore, it is a good place to build industry-specific ontologies that can be easily adjusted and transferred to other systems which also understand SKOS.

References

 

 


[1] The xPlore v1.3 Administration and Development Guide, pages 213-217 discusses how to load your thesaurus and optionally configure debugging to watch search term expansion
.

Slides From Today’s Webinar

March 4th, 2014 by Allison Cotney

In case you missed today’s webinar in which we walked through our Armedia Case Management solution framework with Alfresco Records Management, here are the slides from the presentation!!

 

Armedia Case Management is a web-based, workflow-driven case management software solution designed to help organizations capture, investigate, and manage cases.

Powered by Alfresco, Armedia Case Management provides a complete case management infrastructure to create and manage case information through its lifecycle. From initiation, assignment, investigation, evaluation and disposition Armedia Case Management tracks the case and the relationships to all of the collected information. Armedia Case Management provides pre-built law enforcement and inspector general workflows and data types, and can also support other organization types.

In this webinar, we walked through

  • How to declare a record in Alfresco Records Management within Armedia Case Management
  • How Armedia Case Management delivers easy compliance with Presidential Memorandum on Managing Federal Government Records as well as paperless initiatives

Stay tuned for a recording of the webinar!

Thinking in Mule: How I Learned to Love the Platform

February 27th, 2014 by David Milller

I just finished my first Mule flow. My flow receives an Alfresco node reference and declares a corresponding record in the Alfresco Records Management Application. I learned about Mule and Alfresco RM along the way… specifically, I learned to think like a Mule, instead of a Java developer.

Take the sub-flow to create the record folder. This folder may or may not already exist. In a concurrent environment, I prefer to try creating the folder first; if that fails, then lookup the folder by path. Either way, the sub-flow has to find the folder node-reference so I can later file the new record into the folder.

The try/catch approach. In a Java class, I might implement this by catching the exception from the create attempt. So in my Mule flow, I tried adding an error strategy. Many things went wrong with this. My custom error strategy had to apply only the record folder lookup flow, so it has to be defined in that flow (otherwise it would handle any other unrelated errors that might occur during the overall records management flow). But Mule only allows custom error strategies in flows… not in subflows. Later I found out calling a private flow changes flow variable visibility: this caused me many problems until I realized what was happening. Also, Mule didn’t always call my error strategy when record folder creation failed; also, it still propagated the error to the outer flow, which tended to stop all processing. In short, this approach failed completely!

The first-successful approach. Mule’s first-successful message processor seemed like just the ticket. And it was a real Mule approach – not Java encoded in Mule! I add a first-successful element with two internal processes: the first one to create the folder, the second one to lookup the folder ID. If the first one failed, Mule will automatically try the second one, and all will be well!

<first-successful>  
   <enricher doc:name="create-record-folder">
   <!-- .... -->
   </enricher>
   <enricher doc:name="get record folder id">
   <!-- .... -->
   </enricher>
</first-successful>

Well, this didn’t always work either, most likely because I didn’t know how to tell Mule how to distinguish success from failure.

The keep trying approach. Since the first-successful approach sometimes worked, I thought to myself, well, let’s just keep retrying it!

<until-successful objectStore-ref="until-successful-object-store" maxRetries="15" secondsBetweenRetries="1">  
   <first-successful>
      <enricher doc:name="create-record-folder">
      <!-- .... -->
      </enricher>
      <enricher doc:name="get record folder id">
      <!-- .... -->
      </enricher>
   </first-successful>
</until-successful>

The less said about this, the better. If it didn’t work the first time, I just saw 14 more failures in the log. Very depressing!

The check-the-result-code approach: it works! Checking the result code is older than Mule obviously; older than Java; older than structured exception handling! But it works very well in this situation.

<http:outbound-endpoint  
    exchange-pattern="request-response"
    method="POST"
    doc:name="create record folder"
    encoding="UTF-8"
    ref="alfresco-create-record-folder"
    contentType="application/json; charset=UTF-8"
    mimeType="application/json">
    <response>
        <json:json-to-object-transformer
            name="getJsonFromRecordFolder"
            returnClass="java.util.Map"/>
    </response>
</http:outbound-endpoint>  
<choice doc:name="Choice">  
    <when expression="message.inboundProperties['http.status'] == '200'">
        <set-variable 
            variableName="record-folder-id" 
            value="#[message.payload.persistedObject]"
            doc:name="Variable"/>
        </when>
    <otherwise>
        <processor-chain doc:name="Folder Lookup">
            <!-- lookup the folder here.... -->
        </processor-chain>
    </otherwise>
</choice>

If we get an HTTP 200 response to the first try, then we did create the folder, so we just record the new folder ID from the response JSON. If we did not get an HTTP 200, we assume the folder already existed (which is a good enough strategy for my purposes at the moment), so we lookup the existing folder’s ID.

Writing this first Mule flow definitely has had the desired educational effect. I feel I have a good basic understanding of the Mule platform now. Also, when I run into roadblocks or undesired / inexplicable behavior, now I will trust the defect is in my understanding, and not in Mule itself. At least for such basic use cases as this one!

 

Introducing TDGfABI – Say What?

February 26th, 2014 by cstephenson

Part of a solution that is being implemented for a current client involves implementing a custom retention framework.  Before you ask, no, Alfresco RM was not chosen.  Now that question is out of the way, testing retention is painful.

The reason why I say this, retention involves a lot of categories and moving dates.  So test data that was created last month may not be useful this month.  The test data that is created is crafted to be ingested by Bulk Importer (BFSIT).

Manually creating test data for Bulk Importer is not that difficult but it is a painfully long process when needing to create nested folders with different types of content at each folder.  After a long enough period of manually creating this data I decided to implement a quick little utilty to perform this task.

Hence, TDGfABI was born.  So what does TDGfABI stand for? Test Data Generator for Alfresco Bulk Importer.  Catchy isn’t it?

As mentioned, having sample content in this case was not good enough.  Metadata files are also required so that retention states can be applied.  The next basic requirement was to create a nested folder hierarchy with documents and metadata at the bottom (leaf) folders.  So this got me thinking that this should all be configurable.  The user configuring this should be able to decide how many folders deep to go and the number of folders wide.  This essentially allows for the creation of a well-balanced tree.  A future up will allow for creating an unbalanced folder tree.

It was also determined that although for this project generation of metadata was a requirement, it should be made configurable.  TDGfABI is aiming to be a generic test data generator.

To this point the utility has the ability to create nested folders, documents and metadata at the leaf folders.  I should also add that the same numbers of documents were being created at the leaf folders so this was generating a well-balanced tree.  This is great from Alfresco‘s point of view, but not really reflective of the real world, so generating unbalanced documents at the leaf folders was added as a configuration option.  Oh, also the number of documents generated at each leaf folder is configurable and when unbalanced mode is enabled, at least one document up to max number of documents can be created.

So, a quick recap:

1. Configurable depth and breadth of folders, creating a well-balanced tree

2. Generation of documents and metadata (optional) at the leaf folders

3. Balanced or unbalanced number of documents at the leaf folders

So then I got to thinking, well we also need the ability to test versions as well.  Seeing as BFSIT only supports major versions at the minute – @pmonks, hurry up and release the next version which supports minor versions!  Hmm, should the number of major versions be balanced or unbalanced.  It should be configurable.  So like the creation of documents, the number of versions can be configured and you can decide if these should be the same for every document or unbalanced.  Let’s not forget about metadata for versions.  This should also be configurable to allow for those occasions when you do not need metadata.

But are major versions enough seeing as @pmonks will be releasing an update soon for BFSIT.  Ok, we next thing that was added was the ability to generate major and minor versions (optional, of course).  Metadata, let’s not forget metadata for minor versions, again optional of course.  Hmmm, we should also allow for balanced or unbalanced minor versions as well, configurable of course.

So, quick recap no. 2:

1. Configurable depth and breadth of folders, creating a well-balanced tree

2. Generation of documents and metadata (optional) at the leaf folders

3. Balanced or unbalanced number of documents at the leaf folders

4. Generation of major versions and metadata (optional) for documents

5. Balanced or unbalanced number of major version documents

6. Generation of minor versions and metadata (optional) for documents

7. Balanced or unbalanced number of minor version documents

The last thing that I wanted to implement was creating documents (and versions, metadata, etc.) at any folder level.  Again in the spirit of trying to make everything configurable, the user has the choice of creating documents at every folder depth or only at the leaf folders.  Folder metadata is option as well, just in case I forgot to mention this.

So far, the most documents I have generated were around 350,000.  Each of these documents also has versions and metadata of both documents and versions.  This utility is pretty flexible at creating various structures of test data but where the real configuration power comes in is the types of document and object type metadata that can be generated.

In the next blog, I will discuss how these all work and hopefully in the not too distant future we will put some lipstick on the code to pretty it and open source it.

New Digital Accessibility Regulations For The Travel Industry

February 18th, 2014 by Doug Loo

November 2013 was an important month for the travel industry as the Final Rule for the Air Carrier Access Act (ACAA) was implemented and the formation of the Rail Vehicles Access Advisory Committee was approved by the Access Board.

So what does that mean to your external facing IT Portfolios?  

Air Carriers:

On November 12, 2013, the US Department of Transportation (DOT) published in the Federal Register a final rule that amends its rules implementing the Air Carrier Access Act (ACAA) to require U.S. air carriers and foreign air carriers to make their Web sites that market air transportation to the general public in the United States accessible to individuals with disabilities. In addition, DOT is amending its rule that prohibits unfair and deceptive practices and unfair methods of competition to require ticket agents that are not small businesses to disclose and offer Web-based fares to passengers who indicate that they are unable to use the agents’ Web sites due to a disability.

The amendment applies to:

  • All domestic and foreign airlines operating at least one airplane with a seating capacity of more than 60 passengers, serving U.S. passengers.
  • Domestic and foreign airlines that have more than 10,000 passengers.
  • Ticket Agents that are not small businesses (including travel websites such as such as Kayak.com, cheaptickets.com, airlineconsolidator.com, cheapoair.com, orbitz.com).
Complying with the Air Carrier Access Act Accessibility Amendment

The amendment lays out a two-phase compliance schedule for domestic and foreign airlines.

Compliance Schedule for Air Carrier Access Act

Phase 1: All  “Core functions” must be WCAG 2.0 Level A and AA compliant by December 12, 2015. Core functions are defined as:

  • Booking or changing a reservation (including all flight amenities)
  • Checking-in for a flight
  • Accessing a personal travel itinerary
  • Accessing the status of a flight
  • Accessing a personal frequent flyer account
  • Accessing flight schedules
  • Accessing carrier contact information.

 

Online Disability Accommodation Requests

Requires carriers to make an online service request form available within two years of the rule’s effective date to passengers with disabilities to request services including, but not limited to, wheelchair assistance, seating accommodation, escort assistance for a visually impaired passenger, and stowage for an assistive device.

 

Phase 2: All remaining pages must be made compliant by December 16, 2016.

Accessibility Requirements for Air Carrier Access Act

Please Note: A carrier’s “text-only” version of a web page may only be considered an accessible alternative if it provides the same content and functionality as the corresponding non-text version, and can be reached via an accessible link from the primary site. The “text-only” version must conform to WCAG 2.0 Level A and AA guidelines and must promptly and regularly updated.

In addition, domestic and foreign airlines exceeding 10,000 passengers must ensure accessibility of all kiosks installed after December 12, 2016, and 25% of kiosks in each location must meet the specified accessibility standards by December 12, 2021.

Finally, Ticket Agents that are not small businesses must accessibly disclose and offer web-based fares on or after June 14, 2014

 

What is the risk for non-compliance?

 

The ACCA is known for imposing some extremely large fines, as witnessed by a $50,000.00 fine assed to Frontier Airlines for compliance failure.   If you take into account that domestic carriers alone received almost 19,000 complaints as reported in the 2012 annual report on Complaints Received by Airlines in 2011 and you have a recipe for a fairly large risk to the business.  I would imagine that now electronic forms must be put in place on every Travel website and new amendments to include IT the number of complaint will increase.

The Department’s Enforcement Office stated that it intends to audit carriers as it deems necessary in the future to ensure accurate reporting. In 2009, 2010 and 2011, the Enforcement Office conducted a number of on-site investigations, which involved reviewing carrier records to, among other things, verify the accuracy of the carrier’s disability reporting.

In May 2010, one carrier was fined $100,000 for undercounting Disability-related complaints and in February of 2011, one carrier was fined $2.0 million for violating numerous provisions of the ACAA regulation, including undercounting disability related complaints.  Four other carriers have been assessed civil penalties in cases that in part involved similar kinds of violations.   The Department’s Enforcement Office also investigates each disability-related complaint filed directly with DOT’s Aviation Consumer Protection Division.

The Sunny Side of Tika

February 13th, 2014 by Lee Grayson

One of the great joys of development comes when you learn about a program making your development task very simple.  Apache Tika is one of those programs, and before I even begin to talk about Tika, I have to tip my hat to the developers.  Thank you very much for making my job simpler.

So, what is Tika?

Tika reads the context and content of almost any file so your programs can consume it.  Tika is commonly used for those working with eDiscovery, Taxonomy generation, content capture, and indexing for content management systems. Basically, anytime you want your application to understand all there is about a file, or URI look to Tika to convert it for you.

Tika initially was designed as a part of the Apache Lucene project which is used to automatically index files for full text index searches.  Tika has been around for several years now, but remains a very active application due to the task it was designed for and how well it was written.

The great thing about Tika is you don’t have to know what the file type is for Tika to parse it.  Tika determines the file type for you based on the files header or by extensions. It then uses its built-in parsers to read the file.

How easy is Tika?

Below is the section of my code that implements Tika version 1.4.

	public String parseToString(File givenFile) throws IOException, TikaException{
		Tika myTika = new Tika();
		return myTika.parseToString(givenFile);
	}

Yep, that is parsed content in two simple lines of code.  Now, mind you, I was only interested in getting the content of the file in a text format, but that simplicity is what I am grateful for.  Tika, in all of its facets, is simple to use while remaining very versatile. (Note: When testing this code, make certain the file actually has readable content.  A TIFF or a MP3 file generally does not contain content to be parsed, so you won’t see anything.)

Other ways of using Tika

Okay so maybe you want a little bit more information from a File, like fetching the extra metadata found in a file header.  To complete this task some extra steps are needed, but are well documented. Below is an example of fetching metadata from a file.  In the case below, I’m generating a JSONObject of name-value pairs for the metadata so I the application can consume the metadata later.

public JSONObject fetchMetaData(File givenFile) throws IOException {

		JSONObject jsonMetaData = new JSONObject();

		Metadata metadata = new Metadata();
		Tika myTika = new Tika();

		InputStream instream = new FileInputStream(givenFile);

		myTika.parse(instream, metadata);

		 for (String metaKey : metadata.names()) {
			 String metaValue = metadata.get(metaKey);

			 jsonMetaData.put(metaKey, metaValue);
		}

		return jsonMetaData;
	}

Now maybe you want to fetch both metadata and content, then below uses the Tika classes to fetch both at one time.  That way you only make one pass on the File or inputstream:

public String getEverything(File givenFile) throws IOException, SAXException, TikaException {

		JSONObject jsonMetaData = new JSONObject();
		ContentHandler handler = new BodyContentHandler();

		Metadata metadata = new Metadata();
		Parser parser = new AutoDetectParser();
		InputStream instream = new FileInputStream(givenFile);
		ParseContext parsedContent = new ParseContext();

		parser.parse(instream, handler, metadata, parsedContent);

		 for (String metaKey : metadata.names()) {
			 String metaValue = metadata.get(metaKey);

			 jsonMetaData.put(metaKey, metaValue);
		}

		// Return string
		return jsonMetaData.toJSONString() + "\n\n" + handler.toString();
	}

 

More Tika Documentation and Examples

http://tika.apache.org/index.html

http://tika.apache.org/1.4/gettingstarted.html

http://www.openlogic.com/wazi/bid/314389/Content-mining-with-Apache-Tika

http://mvnrepository.com/artifact/org.apache.tika/tika-parsers/1.4

 

 

Copyright © 2002–2011, Armedia. All Rights Reserved.