Today I noticed that the Openser plugin that I wrote some time ago was causing a major situation on a time-critic application running on an I5OS from IBM iSeries (formerly known as AS400). This application sporadically segfaults after what we thought was a random condition but after some log mining, we found out that the process was being halted because there was too many opened sockets for the process ID that it possesses.
The normal time span for each transaction never exceeds 2 seconds in our typical scenario. This assumption led us to think that there was and bug in the listener thread, some sort of infinite loop breaking the program but, after reading the socket client source file I figured out that was actually a bug in my code.
Since I’m not a TCP expert, I will try to explain the basics of the close procedure of TCP connections to better illustrate the problem.
- After the connection was established there is no distinction between server and client. Any of the two parties can close the connection whenever they want.
- Consider the two parties as A and B. When B tries to close the connection it sends the FIN packet (TCP packet with the flag FIN activated) to A and enters to FIN-WAIT-1 state.
- A replies with an ACK and when B receives this answers, it enters to FIN-WAIT-2. Now B expects the FIN from A.
- A sends FIN to finally confirm the disconnection and after the final ACK from B, the socket is closed.
The problem?
Under certain conditions, I wasn’t closing the sockets explicitly leaving it in an inconsistent state: FIN-WAIT-2 on the i5 side, and CLOSE_WAIT on the Linux side. There is no standard timeout to reuse sockets in FIN-WAIT-2, i5OS uses 10 minutes and this value is far too high to recycle the unused sockets. When the counting reached 2000, the program inevitably crashes.
The fix
[c]
BOOL send_message(struct _cnxaaa_net_config *net_config, const str *content, str *reply_buffer, unsigned int timeout)
{
int sockfd, n;
struct sockaddr_in serv_addr;
struct hostent *server;
struct timeval tv;
tv.tv_sec = 0;
tv.tv_usec = timeout;
sockfd = socket(AF_INET, SOCK_STREAM, 0);
if (sockfd < 0)
{
_CNXAAA_LOG_ERR(“Error opening socket”);
return FALSE;
}
if (setsockopt(sockfd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(struct timeval)) != 0)
{
_CNXAAA_LOG_ERR(“Error assigning socket option”);
return FALSE;
}
server = gethostbyname(net_config->ip);
if (server == NULL)
{
_CNXAAA_LOG_ERR(“No such host”);
return FALSE;
}
bzero((char *) &serv_addr, sizeof(serv_addr));
serv_addr.sin_family = AF_INET;
bcopy((char *)server->h_addr, (char *)&serv_addr.sin_addr.s_addr, server->h_length);
serv_addr.sin_port = htons(net_config->port);
if (connect(sockfd,(struct sockaddr *) &serv_addr,sizeof(serv_addr)) < 0)
{
_CNXAAA_FLOG_ERR(“Error connecting: %s”, strerror( errno ));
close(sockfd); // I wasn’t closing here
return FALSE;
}
n = write(sockfd, content->s, content->len);
if (n < 0)
{
_CNXAAA_FLOG_ERR(“CNXAAA ERROR | Error writing to socket: %s”, strerror( errno ));
close(sockfd); // I wasn’t closing here
return FALSE;
}
n = recv(sockfd, (void *) reply_buffer->s, reply_buffer->len, 0);
if (n < 0)
{
_CNXAAA_FLOG_ERR(“CNXAAA ERROR | Error reading from socket: %s”, strerror( errno ));
close(sockfd); // I wasn’t closing here
return FALSE;
}
reply_buffer->s[n + 1] = ‘