JSTL 1.2 and “Provider org.apache.xerces.jaxp.SAXParserFactoryImpl not found” exception

I am using JSTL for a project. For such a standard, it’s a bit wierd to get started with. Sun provides a spec, but as far as I can see, you have to download the entire J2EE stack to get the jarfile. There’s not a lot of good documentation on getting started with it (this page, though dated and focusing on JSTL 1.0 and 1.1, was helpful).

One exception in particular dogged me. If you are trying to get started using Tomcat (my version is 6.0) and JSTL, and are getting this exception when you are using the JSTL tags, run, don’t walk to your nearest Xerces download center and place xercesImpl.jar, in your lib directory (I also needed Xalan, for JSTL 1.2, which I had to download from here):

HTTP Status 500 -

type Exception report message description The server encountered an internal error () that prevented it from fulfilling this request. exception
javax.servlet.ServletException: javax.xml.parsers.FactoryConfigurationError: Provider org.apache.xerces.jaxp.SAXParserFactoryImpl not found

root cause

javax.xml.parsers.FactoryConfigurationError: Provider org.apache.xerces.jaxp.SAXParserFactoryImpl not found
javax.xml.parsers.SAXParserFactory.newInstance(Unknown Source)

note The full stack trace of the root cause is available in the Apache Tomcat/6.0.14 logs.

Apache Tomcat/6.0.14

How to rotate catalina.out without restarting tomcat

There seems to be some confusion regarding rotating tomcat logs, in particular rotating catalina.out without restarting Tomcat. With Tomcat4, catalina.out contains everything the application developers logged using System.out.println. Because of the way that Tomcat holds onto the file descriptor on Unix, simply moving the file doesn’t work: Tomcat merrily writes to a file that you can no longer access.

This is important to do for the same reasons you should rotate any other system log; disk space is cheap, but logs can fill it quickly, and the information value of logs decreases rapidly with time. Rotating logs typically means you remove them from the filesystem, but you can also archive them off line with tape, or to a SAN, etc. I’ve run into issues with JSVC and large log files too.

There’s an entry in the FAQ and a pointer to a maillist discussion. But you have to dig a bit and I thought I’d just document what I did. (This works for Tomcat4–Tomcat 5.5 did away with the Logger element and I’m guessing you’d just use some of the log4j configuration and RollingFileAppender.)

First off, you want to make sure that you’re rotating your other logs. This is typically done with a Logger component, configured either at the host or context level. Set the timestamp attribute to be ‘true’.

Now Tomcat is handily rotating all your host and context level logs. You may want to gzip or delete old ones via find:
/usr/bin/find ./logs -mtime +60 -name "mylog_log*"|xargs rm -f

The next step is to set the swallowOutput attribute to ‘true’ on all contexts. According to the docs, this attribute causes

the bytes output to System.out and System.err by the web application [to] be redirected to the web application logger.

Now you have to restart tomcat. But from here on out, Tomcat will rotate all configured logs, and catalina.out will only contain start and stop messages.

Oh yeah, and just as that email thread does, I heartily condemn using System.out.println because log4j is so easy to setup and use. But I’ve written code like that (admittedly, usually with a sheepish grin on my face), so I can’t cast the first stone.

Technorati Tags: ,

Destroying robot generated Tomcat sessions

A large effort goes into creating sites that are crawlable by robots, such as Google, Yahoo! and other search engines. However, these programs can create a large number of sessions, if the site is based on servlet technology. Per the servlet spec (the 2.3 specification, page 50), if a client never joins a session, new sessions will be created for each request.

A session is considered new when it is only a prospective session and has not been established. Because HTTP is a request-response based protocol, an HTTP session is considered to be new until a client joins it. A client joins a session when session tracking information has been returned to the server indicating that a session has been established. Until the client joins a session, it cannot be assumed that the next request from the client will be recognized as part of a session.The session is considered to be new if either of the following is true:

  • The client does not yet know about the session
  • The client chooses not to join a session.

These conditions define the situation where the servlet container has no mechanism by which to associate a request with a previous request.

Since all these extra sessions take up memory, and are long lived, a client asked me to look into a way to invalidate them. (I’m not the first person to run into this problem.) The easiest way to do that was to build a filter that examined the User-Agent HTTP header; here’s a nice list of User-Agent values. If the client was any of the robots, we could safely invalidate the session. For some reason, in with Tomcat 4.1, I needed to run session.isNew(); before running session.invalidate();, otherwise the session wasn’t destroyed. The filter was placed at the end of the request chain, as outlined in this article, by calling chain.doFilter(request, response); before the invalidation filter looked at the request or response.

I haven’t seen any performance problems with creating a session and then throwing it away, probably because java is so good at garbage collecting short lived objects. If I did, conditionally disabling session participation in a JSP might be an option to pursue.

JSVC and large log files

jsvc, which is used for daemoning Tomcat and other java applications on unix, takes filenames for stdout and stderr as arguments. One thing to be aware of is that when the either of these files reach a size of just over 2 gigabytes, jsvc simply fails. No error message. If you restart the application, it will note that it can’t write to the file and proceeds to write to the console. I saw this behavior using tomcat 5 on fedora core 4 with jsvc 1.0.1 (described here).

I am not sure exactly what the problem is, but when I started tomcat via the normal shell script, it was able to write to that file. The user that jsvc runs as had no limits on file size:

-bash-3.00$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
pending signals                 (-i) 1024
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 32764
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Therefore, it might be an issue with jsvc. Do note that there are nightly snapshots of jsvc, which might solve the issue. The solution I found is to use the copytruncate option of logrotate.

Options for connecting Tomcat and Apache

Many of the java web applications I’ve worked on run in the Tomcat servlet engine, fronted by an Apache web server. Valid reasons for wanting to run Apache in front of Tomcat are numerous and include increased clickstream statistics, Apache’s ability to quickly and efficiently serve static content such as images, the ability to host other dynamic solutions like mod_perl and PHP, and Apache’s support for SSL certificates. This last is especially important–any site with sensitive data (credit card information, for example) will usually have that data encrypted in transit, and SSL is the default manner in which to do so.

There are a number of different ways to deal with the Tomcat-Apache connection, in light of the concerns mentioned above:

Don’t deal with the connection at all. Run Tomcat alone, responding on the typical http and https ports. This has some benefits; configuration is simpler and fewer software interfaces tends to mean fewer bugs. However, while the documentation on setting up Tomcat to respond to SSL traffic is adequate, Apache handling SSL is, in my experience, far more common. For better or worse, Apache is seen as faster, especially when when confronted with numeric challenges like encryption. Also, as of Jan 2005, Apache serves 70% of websites while Tomcat does not serve an appreciable amount of http traffic. If you’re willing to pay, Netcraft has an SSL survey which might better illuminate the differences in SSL servers.

If, on the other hand, you choose to run some version of the Apache/Tomcat architecture, there are a few different options. mod_proxy, mod_proxy with mod_rewrite, and mod_jk all give you a way to manage the Tomcat-Apache connection.

mod_proxy, as its name suggests, proxies http traffic back and forth between Apache and Tomcat. It’s easy to install, set up and understand. However, if you use this method, Apache will decrypt all SSL data and proxy it over http to Tomcat. (there may be a way to proxy SSL traffic to a different Tomcat port using mod_proxy–if so, I was unable to find the method.) That’s fine if they’re both running on the same box or in the same DMZ, the typical scenario. A byproduct of this method is that Tomcat has no means of knowing whether a particular request came in via secure or insecure means. If using a tool like the Struts SSL Extension, this can be an issue, since Tomcat needs such information to decide whether redirection is required. In addition, if any of the dynamic generation in Tomcat creates absolute links, issues may arise: Tomcat receives requests for localhost or some other hidden hostname (via request.getServerName()), rather than the request for the public host, whichApache has proxied, and may generate incorrect links.

Updated 1/16: You can pass through secure connections by placing the proxy directives in certain virtual hosts:

<VirtualHost _default_:80>
ProxyPass /tomcatapp http://localhost:8000/tomcatapp
ProxyPassReverse /tomcatapp http://localhost:8000/tomcatapp

<VirtualHost _default_:443>

SSLProxyEngine On
ProxyPass /tomcatapp https://localhost:8443/tomcatapp
ProxyPassReverse /tomcatapp https://localhost:8443/tomcatapp

This doesn’t, however, address the getServerName issue.

Updated 1/17:

Looks like the Tomcat Proxy Howto can help you deal with the getServerName issue as well.

Another option is to run mod_proxy with mod_rewrite. Especially if the secure and insecure parts of the dynamic application are easily separable (for example, if the application was split into /secure/ and /normal/ chunks), mod_rewrite can be used to rewrite the links. If a user visits this url: https://www.example.com/application/secure and traverses a link to /application/normal, mod_rewrite can send them to http://www.example.com/application/normal/, thus sparing the server from the strain of serving pages needlessly encrypted.

mod_jk is the usual way to connect Apache and Tomcat. In this case, Tomcat listens on a different port and a piece of software known as a connector enables Apache to send the requests to Tomcat with more information than is possible with a simple proxy. For instance, certain variables are sent via the connector when Apache receives an SSL request. This allows Tomcat full knowledge of the state of the request, and makes using a tool like the aforementioned Struts SSL Extension possible. The documentation is good. However using mod_jk is not always the best choice; I’ve seen some performance issues with some versions of the software. You almost always have to build it yourself: binary releases of mod_jk are few and far between, I’ve rarely found the appropriate version for my version of Apache, and building mod_jk is confusing. (Even though mod_jk 1.2.8 provides an ant script, I ended up using the old ‘configure/make/make install’ process because I couldn’t make the ant script work.)

In short, there are plenty of options for connecting Tomcat and Apache. In general, I’d start out using mod_jk, simply because that’s the option that was built specifically to connect the two; mod_proxy doesn’t provide quite the same level of integration.

© Moore Consulting, 2003-2017 +