Enterprise Content Management

Armedia Blog

Archive for the ‘ECM Industry’ Category

Caliente Benchmark

March 26th, 2013 by Scott Roth

Armedia’s Caliente is a high-performance content migration tool designed to quickly and easily import content and metadata into a variety of leading content management systems (CMS). Caliente has been used in numerous projects delivered by Armedia and is currently in use as a key component of many more. Some of our clients run Caliente continually to import content from external “feeds” into managed content repositories. These customers process millions of content files per day using multiple instances of Caliente. Other implementations of Caliente were designed for one-time migrations of content from one CMS to another. These instances also have moved millions of files.

Inevitably, a prospective client’s first question is, “How fast can it load my [insert number here] files?” Our pat answer is, “How fast do you need them loaded?” I think that’s a fair answer to an inherently unfair question. Obviously the rate at which Caliente can import content files relies on several variables outside the control of Caliente, or Armedia, or sometimes even the client.

For example:

  • the speed of a client’s network infrastructure can add latency to the process;
  • the size of the content files being imported can also add to latency;
  • the speed of the hosting hardware running Caliente and the resources available to it can affect performance;
  • the capacity of the CMS to ingest the content can affect performance;
  • the complexity of any transformations required to the metadata or content before loading can add latency to the process;
  • dependence on external resources (e.g., looking up additional metadata from an external system) can add latency to the process;
  • drive and/or database contention among processes running on the hosting machine.

These and other factors all effect how quickly Caliente can load a customer’s files. The good news – and the justification for the answer given – is that there are numerous configurations and techniques Armedia can employ to ensure Caliente can load a customer’s files in the time frame required.

All that said, I think what most customers are looking for when they ask that questions is a benchmark — honest to goodness statistics, not anecdotal explanations about performance. So, following is a quick benchmark I performed to provide hard evidence of Caliente’s performance characteristics.

The entire benchmark was run on a virtual machine hosted on my laptop. The virtual server was running Windows 2003, with 2 CPUs (2.2MHz) and 4GB of RAM. The virtual server was running Caliente, SQL Server 2005, and Documentum Content Server 6.7. For test content I used a variety of files downloaded from textfiles.com, the Gutenberg project, and binaries from the image itself.

Here are some statistics about the test corpus:

  • corpus file count: 25,890 files;
  • corpus size: 3.31 GB;
  • average file size: 133 KB;
  • minimum file size: 168 bytes;
  • maximum file size: 36.9 MB.

I ran Caliente in “hot folder” mode, which simply means it waited for me to drop files into a watched folder, and then it processed them. Each content file was accompanied by a metadata file that contained a set of five attributes to be set on the Documentum object (dm_document) when it was imported.

Here are the benchmark results:

  • total files processed: 25,890;
  • total files imported: 25,890;
  • total processing time: 00:52:31 (hr:min:sec);
  • rates:
    • 0.122 sec / file;
    • 8.22 files / sec;
    • 0.935 sec / MB;
    • 1.07 MB / sec.

A few notes about this benchmark:

  • To monitor performance for Caliente meant turning on detailed logging and running other hardware monitoring processes (e.g., Microsoft’s Performance MMC, and Task Manager). Monitoring performance like this inherently introduces load and latency that otherwise would not exist, thus affecting the results of the benchmark. It is sort of a computer analog to physics’ observer effect. Therefore, the benchmark metrics listed above could be improved by turning off all of this monitoring and debugging.  Not surprising, my monitoring of the import process revealed that disk I/O was the greatest bottleneck in my environment.
  • If you were to run this benchmark on a different server or in a different environment, you would likely receive different results — even if you used the same test corpus and configuration of Caliente. That’s just the nature of benchmarks; they are only valid in very controlled situations. However, they are a good indicator of performance as long as you understand the conditions of the benchmark.

The point I want to make is this: Caliente is capable of processing and importing an impressive volume of content, and can be tuned and configured to meet your performance requirements in whatever environment you run it. With Armedia’s vast experience with content-related migration and migration tools like Caliente, we can assure you that we can meet your import/migration performance requirements, whatever they may be.

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Alfresco Content.gov 2013 Presentation

March 14th, 2013 by Allison Cotney

Did you miss our recent presentation at Alfresco Content.gov? No worries! We’ve posted the slides here as well as additional information about the use of Alfresco within the Federal Government.

To learn more about our federal services, visit our microsite!
TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Alfresco DevCon 2012 Presentation – A FOIA Solution Pattern

November 28th, 2012 by Allison Cotney

In case you missed us at Alfresco DevCon San Jose, Here is a copy of our presentation.

This presentation focused on illustrating how Alfresco and other open source technologies can be leveraged for government agencies.

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Query Results Truncated in Documentum

November 26th, 2012 by Scott Roth

Recently, some colleagues and I were discussing whether the Content Server truncated result sets for large queries. They insisted that it did and that the largest result set Documentum would return was 1000 rows or 350 rows from any single source (the default values for dfc.search.max_results and dfc.search.max_results_per_source in the dfc.properties file). “Ridiculous!”, I exclaimed. I had run queries that returned 1,000s of rows and could prove it. So, I set out on this little research project.

To prove my point, I decided to run a query that returned a known result set from a variety of clients, while changing the settings of dfc.search.max_results and dfc.search.max_results_per_source. To set these properties, I added the following lines to the dfc.properties file on both the Content Server and the DA web application server. I set these properties artificially low to make the results obvious.

dfc.search.max_results = 100
dfc.search.max_results_per_source = 10

The query I ran was select r_object_id from dm_folder. In my repository, this query returned 743 rows (from iDQL, which I used as my baseline). I also ran this query from the RepoInt utility, the DA DQL Editor and the DA Advanced Search page. If there was any truth to the claims of my colleagues, I should see a result set no larger than 100 rows when the properties were in effect. See the table below for the results.

Client No Config Changes Content Server Only App Server Only
iDQL32 743 743 743
RepoInt 743 743 743
DA DQL Editor 743 743 743
DA Adv. Search 350 350 10

Interestingly, the Advanced Search did truncate the result set, but not as I expected. It truncated the result set to 350 when these properties were not explicitly set, leading me to believe there was some sort of default in play. It also truncated the result set to 10, not 100, when the properties were set. What’s going on here?

After reading up a bit on dfc.search.max_results and dfc.search.max_results_per_source properties, I concluded that these configuration settings only affect ECIS/FS2 searches and not “regular” client searches (i.e., iDQL, RepoInt, DQL Editor, etc.). However, since Webtop (and DA) are configured to use ECIS/FS2 when they are installed, it appears that the Advanced Search does respect the dfc.search.max_results and dfc.search.max_results_per_source properties when they are set. Here’s how it works:

The dfc.search.max_results property dictates how large the final result set can be. The default value is 1,000. In my testing, this was supposed to be 100 rows. However, this setting is the maximum setting for the entire result set and is further constrained by the dfc.search.max_results_per_source property.

The dfc.search.max_results_per_source property dictates the maximum number of results that can be returned from a single source. The default value is 350. Since my testing only involved one repository, the maximum number of results returned was 10. If I had searched across 2 repositories, the final result set would have contained 20 rows (max). Following this logic, if I had searched across 20 repositories, the result would have been 100 (the maximum size allows by the dfc.search.max_results property), not 200 as expected.

My advice is if you are only searching on one repository, set the dfc.search.max_results and dfc.search.max_results_per_source properties equal to each other to ensure your Advanced Searches return maximum result sets. What the actual value of these properties are to produce maximum performance and efficiency is up to you to determine.

So, my colleagues and I were both right, we just needed to specify how we were running our queries.

This blog was originally posted at msroth.wordpress.com on Juy 5, 2010.

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

DrupalCamp Atlanta Presentation – Drupal and PhoneGap: The Mobile App

November 9th, 2012 by Allison Cotney

Did you miss our presentation at DrupalCamp Atlanta? No worries, here is the presentation detailing the integration of Drupal with PhoneGap and how the Armedia team used that integration to deliver a mobile app for The Well Project.

 

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Finding an Object’s Content File in Documentum

October 24th, 2012 by Scott Roth

You probably know that Documentum (in its default state) stores content on the file system and retains a pointer to the content in its database. Likely, you have navigated the file store on the Content Server and discovered directories like ../data/docbase/content_storage_01/00000123/80/00/23/. How in the world does this directory structure relate back to a particular object?

Documentum uses several objects to hold persistence information about content; we will use five of them to determine where the content for an object with r_object_id = '0900000180023d07' resides: dmr_content, dm_format, dm_filestore, dm_docbase_config, and dm_location. The following query will get us all the information we need to assemble the path to the object’s content.

select data_ticket, dos_extension, file_system_path, r_docbase_id from dmr_content c, dm_format f, dm_filestore fs, dm_location l, dm_docbase_config dc where any c.parent_id = '0900000180023d07' and f.r_object_id = c.format and fs.r_object_id = c.storage_id and l.object_name = fs.root

Result:

  • data_ticket = -2147474649
  • dos_extension = txt
  • file_system_path = C:/Documentum/data/docbase/content_storage_01
  • r_docbase_id = 123

The trick to determining the path to the content is in decoding the data_ticket's 2′s complement decimal value. Convert the data_ticket to a 2′s compliment hexidecimal number by first adding 2^32 to the number and then converting it to hex. You can use a scientific calculator to do this or grab some Java code off the net.

  • -2147474649 + 2^32 = (-2147474649 + 4294967296) = 2147492647
  • converting 2147492647 to hex = 80002327

Now, split the hex value of the data_ticket at every two characters, append it to file_system_path and docbase_id (padded to 8 digits), and add the dos_extension. Viola! you have the complete path to the content file.

C:/Documentum/data/docbase/content_storage_01/00000123/80/00/23/27.txt

I think this is a really clever way to manage the creation and assignment of directories and filenames, don’t you? In addition, this scheme guarantees that there is never more than 256 files in a single directory, increasing optimization.

You can do it in reverse also. Say you have file with this path: /80/20/23.txt. What is its r_object_id?

  • converting 80002023 to decimal = 2147491875
  • subtract 2^32: 2147491875 – 4294967296 = -2147475421
  • select r_object_id, object_name from dm_sysobject, s dmr_content c where any c.parent_id = s.r_object_id and c.data_ticket = -2147475421.0

Note: You must append “.0″ to the data_ticket value to force DQL to process the variable as a floating point number, otherwise you get an integer overflow error.

Of course, you can always use the GET_FILE administrative method to find an object’s content’s file path. Just remember, that the content ID it is asking for is the r_object_id for the dmr_content object.

This blog was originally posted at msroth.wordpress.com on Sept 9, 2011.

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Birds of a Feather? Alfresco and Ephesoft

October 19th, 2012 by Allison Cotney

In our last blog we discussed the value and inherent benefits of integrating Alfresco, the open source enterprise content management platform, and Ephesoft, the open source intelligent document capture solution, in an enterprise solution. In this blog, let’s look at the business models of both of these companies and get a better understanding of why these two technologies work so well with one another.

Benefit 1: COST

Without a doubt, the value that both of these companies place on open source technology development is a huge factor in their success. Still, let’s take a quick look at the cost perspective of these two systems compared to other industry propriety solutions.

Cost of implementation is a major differentiator among an Alfresco – Ephesoft integrated solution and other, proprietary, solutions in the industry. Ephesoft itself reduces document capture and mailroom automation by up 80% over comparative proprietary capture solutions (SOURCE | TWEET THIS STAT). Combine that number with the savings attributed with the Alfresco platform, and the numbers, as you see above, just pile on.

This cost difference is attributed to the fact that neither Alfresco nor Ephesoft require their users to pay an up-front license fee. Instead, the companies charge and annual software support fee to their enterprise clients. This lower cost model allows for clients to achieve faster return on their technology investments while ensuring that the organization is not foregoing any of the strengths that are expected of an enterprise solution.

Benefit 2: CLOUD READY

Both Alfresco and Ephesoft are poised for cloud-based solutions. Ephesoft is the industry’s only java-based, 100% browser-based Advanced Capture System. The fact that Ephesoft is browser-based further enables employees to be able to ingest documents from anywhere, creating a capture solution that is mobile ready out-of-the-box. Further, since all users need to access Ephesoft is a standard web browser, this solution is accomplished without needing to download any additional software.

Benefit 3: ADHERENCE TO INDUSTRY OPEN SOURCE STANDARDS

The open source nature of both of these platforms allows  them to embrace industry standards such as Content Management Interoperability Services (CMIS) that provide guidelines for ECM solution integration methodologies. By doing this, both Alfresco and Ephesoft are poised to effectively operate with existing technology portfolios. This provides several benefits, including

  • rapid deployment of solutions,
  • minimization of stresses and headaches attributed to system migration, and
  • faster achievement of return of technology investments.

The nature of open source also allows for the solution to be both extremely scalable and customizable.

Benefit 4: FLEXIBILITY

Another inherent benefit of implementing an Alfresco-Ephesoft integrated solution to control the full lifecycle of your documents  is the endless possibilities that come as a result of the open source nature of both of these products.  Regardless of the type of business processes that are in play, both Alfresco and Ephesoft can be configured to provide custom workflows and specific content management requirements, from Document Management or Records Management to Collaboration, or even more specific workflow requirements like Case Management.

As you can see, there are several benefits of this solution that have roots  in the foundations of both of these open source technologies. One thing is for sure, the possibilities of what can be accomplished with these systems are endless!! For more information, view our Alfresco and Ephesoft Blog Series!

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Tips for Starting an Enterprise Document Capture Project

October 3rd, 2012 by Lee Grayson

In today’s world, it is critical that businesses and government organizations are able to capture and fully utilize all of the information they have at their disposal, this includes information found on both paper and electronic documents. This is where intelligent document capture solutions can come into play. By integrating an Intelligent Document Capture solution with your existing document management system, you gain the ability to digitize paper documents and utilize them in electronic format within your existing workflows.

However, before an Intelligent Document Capture solution can be successfully implemented, a few steps have to be taken to ensure that the solution is implemented correctly according to each businesses specific requirements and current processes.

Tip #1 – Know Your Business 

(more…)

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Part IV – Opening a File in a Documentum Repository through the Adobe FrameMaker Integration

September 25th, 2012 by Jonathan Byars

 Now that the Adobe FrameMaker application has been successfully connected to the Documentum Content Server Repository, it is time to open a project file.  Opening a project file can be accomplished through several methods.  The methods include browsing the repository through the tree view pane, clicking the “File” then “Open” menu items, or clicking the “Open…” link on the Adobe FrameMaker welcome screen.  The latter two methods produce the same results.

Opening a File through FrameMaker Documentum Integration

(more…)

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Part III – Connecting to a Documentum Repository

September 25th, 2012 by Jonathan Byars

 Once you have successfully completed the steps from Part II and the application has restarted following the steps takes to create a connection to the Documentum repository from my previous blog, you will need to establish the connection to the Documentum Content Server Repository that was configured. To do this part of the process, we will turn to the connection manager.

Click on the “CMS” file menu then select “Connection Manager”.

Select the Connection Manager

(more…)

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare
Copyright © 2002–2011, Armedia. All Rights Reserved.