Enterprise Content Management

Armedia Blog

Archive for the ‘Content Management’ Category

Caliente Benchmark

March 26th, 2013 by Scott Roth

Armedia’s Caliente is a high-performance content migration tool designed to quickly and easily import content and metadata into a variety of leading content management systems (CMS). Caliente has been used in numerous projects delivered by Armedia and is currently in use as a key component of many more. Some of our clients run Caliente continually to import content from external “feeds” into managed content repositories. These customers process millions of content files per day using multiple instances of Caliente. Other implementations of Caliente were designed for one-time migrations of content from one CMS to another. These instances also have moved millions of files.

Inevitably, a prospective client’s first question is, “How fast can it load my [insert number here] files?” Our pat answer is, “How fast do you need them loaded?” I think that’s a fair answer to an inherently unfair question. Obviously the rate at which Caliente can import content files relies on several variables outside the control of Caliente, or Armedia, or sometimes even the client.

For example:

  • the speed of a client’s network infrastructure can add latency to the process;
  • the size of the content files being imported can also add to latency;
  • the speed of the hosting hardware running Caliente and the resources available to it can affect performance;
  • the capacity of the CMS to ingest the content can affect performance;
  • the complexity of any transformations required to the metadata or content before loading can add latency to the process;
  • dependence on external resources (e.g., looking up additional metadata from an external system) can add latency to the process;
  • drive and/or database contention among processes running on the hosting machine.

These and other factors all effect how quickly Caliente can load a customer’s files. The good news – and the justification for the answer given – is that there are numerous configurations and techniques Armedia can employ to ensure Caliente can load a customer’s files in the time frame required.

All that said, I think what most customers are looking for when they ask that questions is a benchmark — honest to goodness statistics, not anecdotal explanations about performance. So, following is a quick benchmark I performed to provide hard evidence of Caliente’s performance characteristics.

The entire benchmark was run on a virtual machine hosted on my laptop. The virtual server was running Windows 2003, with 2 CPUs (2.2MHz) and 4GB of RAM. The virtual server was running Caliente, SQL Server 2005, and Documentum Content Server 6.7. For test content I used a variety of files downloaded from textfiles.com, the Gutenberg project, and binaries from the image itself.

Here are some statistics about the test corpus:

  • corpus file count: 25,890 files;
  • corpus size: 3.31 GB;
  • average file size: 133 KB;
  • minimum file size: 168 bytes;
  • maximum file size: 36.9 MB.

I ran Caliente in “hot folder” mode, which simply means it waited for me to drop files into a watched folder, and then it processed them. Each content file was accompanied by a metadata file that contained a set of five attributes to be set on the Documentum object (dm_document) when it was imported.

Here are the benchmark results:

  • total files processed: 25,890;
  • total files imported: 25,890;
  • total processing time: 00:52:31 (hr:min:sec);
  • rates:
    • 0.122 sec / file;
    • 8.22 files / sec;
    • 0.935 sec / MB;
    • 1.07 MB / sec.

A few notes about this benchmark:

  • To monitor performance for Caliente meant turning on detailed logging and running other hardware monitoring processes (e.g., Microsoft’s Performance MMC, and Task Manager). Monitoring performance like this inherently introduces load and latency that otherwise would not exist, thus affecting the results of the benchmark. It is sort of a computer analog to physics’ observer effect. Therefore, the benchmark metrics listed above could be improved by turning off all of this monitoring and debugging.  Not surprising, my monitoring of the import process revealed that disk I/O was the greatest bottleneck in my environment.
  • If you were to run this benchmark on a different server or in a different environment, you would likely receive different results — even if you used the same test corpus and configuration of Caliente. That’s just the nature of benchmarks; they are only valid in very controlled situations. However, they are a good indicator of performance as long as you understand the conditions of the benchmark.

The point I want to make is this: Caliente is capable of processing and importing an impressive volume of content, and can be tuned and configured to meet your performance requirements in whatever environment you run it. With Armedia’s vast experience with content-related migration and migration tools like Caliente, we can assure you that we can meet your import/migration performance requirements, whatever they may be.

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Query Results Truncated in Documentum

November 26th, 2012 by Scott Roth

Recently, some colleagues and I were discussing whether the Content Server truncated result sets for large queries. They insisted that it did and that the largest result set Documentum would return was 1000 rows or 350 rows from any single source (the default values for dfc.search.max_results and dfc.search.max_results_per_source in the dfc.properties file). “Ridiculous!”, I exclaimed. I had run queries that returned 1,000s of rows and could prove it. So, I set out on this little research project.

To prove my point, I decided to run a query that returned a known result set from a variety of clients, while changing the settings of dfc.search.max_results and dfc.search.max_results_per_source. To set these properties, I added the following lines to the dfc.properties file on both the Content Server and the DA web application server. I set these properties artificially low to make the results obvious.

dfc.search.max_results = 100
dfc.search.max_results_per_source = 10

The query I ran was select r_object_id from dm_folder. In my repository, this query returned 743 rows (from iDQL, which I used as my baseline). I also ran this query from the RepoInt utility, the DA DQL Editor and the DA Advanced Search page. If there was any truth to the claims of my colleagues, I should see a result set no larger than 100 rows when the properties were in effect. See the table below for the results.

Client No Config Changes Content Server Only App Server Only
iDQL32 743 743 743
RepoInt 743 743 743
DA DQL Editor 743 743 743
DA Adv. Search 350 350 10

Interestingly, the Advanced Search did truncate the result set, but not as I expected. It truncated the result set to 350 when these properties were not explicitly set, leading me to believe there was some sort of default in play. It also truncated the result set to 10, not 100, when the properties were set. What’s going on here?

After reading up a bit on dfc.search.max_results and dfc.search.max_results_per_source properties, I concluded that these configuration settings only affect ECIS/FS2 searches and not “regular” client searches (i.e., iDQL, RepoInt, DQL Editor, etc.). However, since Webtop (and DA) are configured to use ECIS/FS2 when they are installed, it appears that the Advanced Search does respect the dfc.search.max_results and dfc.search.max_results_per_source properties when they are set. Here’s how it works:

The dfc.search.max_results property dictates how large the final result set can be. The default value is 1,000. In my testing, this was supposed to be 100 rows. However, this setting is the maximum setting for the entire result set and is further constrained by the dfc.search.max_results_per_source property.

The dfc.search.max_results_per_source property dictates the maximum number of results that can be returned from a single source. The default value is 350. Since my testing only involved one repository, the maximum number of results returned was 10. If I had searched across 2 repositories, the final result set would have contained 20 rows (max). Following this logic, if I had searched across 20 repositories, the result would have been 100 (the maximum size allows by the dfc.search.max_results property), not 200 as expected.

My advice is if you are only searching on one repository, set the dfc.search.max_results and dfc.search.max_results_per_source properties equal to each other to ensure your Advanced Searches return maximum result sets. What the actual value of these properties are to produce maximum performance and efficiency is up to you to determine.

So, my colleagues and I were both right, we just needed to specify how we were running our queries.

This blog was originally posted at msroth.wordpress.com on Juy 5, 2010.

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Finding an Object’s Content File in Documentum

October 24th, 2012 by Scott Roth

You probably know that Documentum (in its default state) stores content on the file system and retains a pointer to the content in its database. Likely, you have navigated the file store on the Content Server and discovered directories like ../data/docbase/content_storage_01/00000123/80/00/23/. How in the world does this directory structure relate back to a particular object?

Documentum uses several objects to hold persistence information about content; we will use five of them to determine where the content for an object with r_object_id = '0900000180023d07' resides: dmr_content, dm_format, dm_filestore, dm_docbase_config, and dm_location. The following query will get us all the information we need to assemble the path to the object’s content.

select data_ticket, dos_extension, file_system_path, r_docbase_id from dmr_content c, dm_format f, dm_filestore fs, dm_location l, dm_docbase_config dc where any c.parent_id = '0900000180023d07' and f.r_object_id = c.format and fs.r_object_id = c.storage_id and l.object_name = fs.root

Result:

  • data_ticket = -2147474649
  • dos_extension = txt
  • file_system_path = C:/Documentum/data/docbase/content_storage_01
  • r_docbase_id = 123

The trick to determining the path to the content is in decoding the data_ticket's 2′s complement decimal value. Convert the data_ticket to a 2′s compliment hexidecimal number by first adding 2^32 to the number and then converting it to hex. You can use a scientific calculator to do this or grab some Java code off the net.

  • -2147474649 + 2^32 = (-2147474649 + 4294967296) = 2147492647
  • converting 2147492647 to hex = 80002327

Now, split the hex value of the data_ticket at every two characters, append it to file_system_path and docbase_id (padded to 8 digits), and add the dos_extension. Viola! you have the complete path to the content file.

C:/Documentum/data/docbase/content_storage_01/00000123/80/00/23/27.txt

I think this is a really clever way to manage the creation and assignment of directories and filenames, don’t you? In addition, this scheme guarantees that there is never more than 256 files in a single directory, increasing optimization.

You can do it in reverse also. Say you have file with this path: /80/20/23.txt. What is its r_object_id?

  • converting 80002023 to decimal = 2147491875
  • subtract 2^32: 2147491875 – 4294967296 = -2147475421
  • select r_object_id, object_name from dm_sysobject, s dmr_content c where any c.parent_id = s.r_object_id and c.data_ticket = -2147475421.0

Note: You must append “.0″ to the data_ticket value to force DQL to process the variable as a floating point number, otherwise you get an integer overflow error.

Of course, you can always use the GET_FILE administrative method to find an object’s content’s file path. Just remember, that the content ID it is asking for is the r_object_id for the dmr_content object.

This blog was originally posted at msroth.wordpress.com on Sept 9, 2011.

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Tips for Starting an Enterprise Document Capture Project

October 3rd, 2012 by Lee Grayson

In today’s world, it is critical that businesses and government organizations are able to capture and fully utilize all of the information they have at their disposal, this includes information found on both paper and electronic documents. This is where intelligent document capture solutions can come into play. By integrating an Intelligent Document Capture solution with your existing document management system, you gain the ability to digitize paper documents and utilize them in electronic format within your existing workflows.

However, before an Intelligent Document Capture solution can be successfully implemented, a few steps have to be taken to ensure that the solution is implemented correctly according to each businesses specific requirements and current processes.

Tip #1 – Know Your Business 

(more…)

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Part IV – Opening a File in a Documentum Repository through the Adobe FrameMaker Integration

September 25th, 2012 by Jonathan Byars

 Now that the Adobe FrameMaker application has been successfully connected to the Documentum Content Server Repository, it is time to open a project file.  Opening a project file can be accomplished through several methods.  The methods include browsing the repository through the tree view pane, clicking the “File” then “Open” menu items, or clicking the “Open…” link on the Adobe FrameMaker welcome screen.  The latter two methods produce the same results.

Opening a File through FrameMaker Documentum Integration

(more…)

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Part III – Connecting to a Documentum Repository

September 25th, 2012 by Jonathan Byars

 Once you have successfully completed the steps from Part II and the application has restarted following the steps takes to create a connection to the Documentum repository from my previous blog, you will need to establish the connection to the Documentum Content Server Repository that was configured. To do this part of the process, we will turn to the connection manager.

Click on the “CMS” file menu then select “Connection Manager”.

Select the Connection Manager

(more…)

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Part II – Create a Connection to a Documentum Repository

September 25th, 2012 by Jonathan Byars

Once you have completed the steps taken in Part I of this series, you need to create the connection. In order to utilize the integration between the Adobe FrameMaker application and the Documentum Content Server Repository, connection information must be added and configured.  The following information will guide you through adding the appropriate information in order to successfully connect through the Documentum Foundation Services (DFS) Software Development Kit (SDK) interface.

 

Here are the steps:

(more…)

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Connecting to a Documentum Repository through Adobe FrameMaker v.10 – Part I

September 25th, 2012 by Jonathan Byars

With the release of Adobe FrameMaker version 10, comes the ability to connect directly to a Content Management System (CMS). This enables users to perform search, read, write, delete, update, check-in and check-out operations on configured repositories. One of the CMS solutions that FrameMaker can be connected to is the Documentum Content Server Repository.  This connection is completed through the Documentum Foundation Services (DFS) and the Documentum Foundation Services Software Development Kit (DFS SDK).  In this blog series, I will walk through exactly how to configure and use this connection to directly access Adobe FrameMaker project files from within the Documentum Content Server Repository.

Part I: Testing the Documentum Foundation Services (DFS) availability

Before starting the process of connecting the Adobe FrameMaker application to the Documentum Content Server Repository, a few initial steps need to be taken. The following information will need to be gathered from the Documentum System Administrator in order to successfully connect. They are:

(more…)

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare

Two Approaches to Source Code Management (SCM)

June 11th, 2012 by Scott Roth

I’ve been thinking about source code management, because I recently encountered two very different approaches to it. Let me begin with a quick overview of some key source code management terms and concepts. First, what do I mean by source code management (SCM)? SCM is the art and science of controlling the source code for a project/product. Control includes concepts like: allowing developers simultaneous access to code so they don’t conflict with each other while innovating; allowing maintenance for current releases to occur while innovation continues on the main project/product line; having instant access to a version of delivered source code; keeping a detailed history of all changes made; and providing rollback capabilities should changes need to backed out [Obviously, this is my laymen’s definition. Here is a better definition and a pretty good summary of SCM: http://www.cio.com/article/120802/Source_Code_Management_Systems_Trends_Analysis_and_Best_Features].

SCM usually involves a SCM system like Subversion  or Git  to enable versioning, locking and general control of the source code files. SCM systems provide two very important concepts that enable good management of source code: tags and branches. Every SCM system implements these concepts differently, but in general, they work like this: a tag is a bookmark in the codeline, like “release 1.0” or “beta release 3.3”. These bookmarks allow developers to quickly and easily retrieve a specific version of the entire project/product from the SCM system. A branch is simply a copy of the main codeline (usually referred to as the trunk) that is set aside for independent maintenance or development.

I mentioned above that I recently encountered two different approaches to SCM. The first approach I view as the more “traditional”: all work is done on the main (trunk) codeline, and tags, branches, and code merges are used to manage releases and bug fixes. Figure 1 depicts this approach graphically.

traditional-approach-to-source-code-management

(more…)

TwitterFacebookLinkedInStumbleUponPinterestGoogle+DeliciousDiggPrintFriendlyShare
Copyright © 2002–2011, Armedia. All Rights Reserved.