At my current employer, FusionAuth, we have extracted out all the user facing messages to properties files. These files are maintained by the community, and cover over fifteen languages.
We maintain the English language version. Whenever new user facing messages are added, the properties file is updated. Sometimes, the community contributed messages files are out of date.
In addition, there are a number of common languages that we simply haven’t had a community member offer a translation for.
These include:
- Korean (80M speakers)
- Hindi (691M)
- Punjabi (113M)
- Greek (13.5M)
- Many others
(All numbers from Wikipedia.)
While I have some doubts and concerns about AI, I have been using ChatGPT for personal projects and thought it would be interesting to use OpenAI APIs to automate translation of these properties files.
I threw together some ruby code, using ruby-openai, the ruby OpenAI community library that had been updated most recently.
I also used ChatGPT for a couple of programming queries (“how do I load a properties file into a ruby hash”) because, in for a penny, in for a pound.
The program
Here’s the results:
require "openai"
key = "...KEY..."
client = OpenAI::Client.new(access_token: key)
def properties_to_hash(file_path)
properties = {}
File.open(file_path, "r") do |f|
f.each_line do |line|
line = line.strip
next if line.empty? || line.start_with?("#")
key, value = line.split("=", 2)
properties[key] = value
end
end
properties
end
def hash_to_properties(hash, file_path)
File.open(file_path, "w") do |file|
hash.each do |key, value|
file.write("#{key}=#{value}\n")
end
end
end
def build_translation(properties_in, properties_out, errkeys, language, client)
properties_in.each do |key, value|
sleep 1
# puts "# translating #{key}"
message = value
content = "Translate the message '#{message}' into #{language}"
response = client.chat(
parameters: {
model: "gpt-3.5-turbo", # Required.
messages: [{ role: "user", content: content}], # Required.
temperature: 0.7,
}
)
if not response["error"].nil?
errkeys << key #puts response
end
if response["error"].nil?
translated_val = response.dig("choices", 0, "message", "content")
properties_out[key] = translated_val
puts "#{key}=#{translated_val}"
end
end
end
# start the actual translation
file_path = "messages.properties"
properties = properties_to_hash(file_path)
#puts properties.inspect
properties_hi = {}
language = "Hindi"
errkeys = []
build_translation(properties, properties_hi, errkeys, language, client)
puts "# errkeys has length: " + errkeys.length.to_s
while errkeys.length > 0
# retry again with keys that errored before
newprops = {}
errkeys.each do |key|
newprops[key] = properties[key]
end
# reset errkeys
errkeys = []
build_translation(newprops, properties_hi, errkeys, language, client)
# puts "# errkeys has length: " + errkeys.length.to_s
end
# save file
hash_to_properties(properties_hi, "messages_hi.properties")
More about the program
This script translates 482 English messages into a different language. It takes about 28 minutes to run. 8 minutes of that are the sleep statement, of which more below. To run this, I signed up for an OpenAI key and a paid plan. The total cost was about $0.02.
I tested it with two languages, French and Hindi. I used French because we have a community provided French translation. Therefore, I was able to spot check messages against that. There was a lot of overlap and similarity. I also used Google Translate to check where they differed, and GPT seemed to be more in keeping with the English than the community translation.
I can definitely see places to improve this script. For one, I could augment it with a set of loops over different languages, letting me support five or ten more languages with one execution. I also had the messages file present in my current directory, but using ruby to retrieve them from GitHub or running this code in the cloned project would be easy.
The output occasionally needed to be reviewed and edited. Here’s an example:
[blank]=आवश्यक (āvaśyak)
[blocked]=अनुमति नहीं है (Anumati nahi hai)
[confirm]=पुष्टि करें (Pushṭi karen)
Now, I’m no expert on Hindi, but I believe I should remove the English/Latin letters above. One option would be to exclude certain keys or to refine the prompt I provided. Another would be to find someone who knows Hindi who could review it.
About that sleep call. I built it in because in my initial attempt, I saw error messages from the OpenAI API and was trying to slow down my requests so as not to trigger that. I didn’t dig too deep into the reason for the below exception; at first glance it appears to be a networking issue.
C:/Ruby31-x64/lib/ruby/3.1.0/net/protocol.rb:219:in `rbuf_fill': Net::ReadTimeout with #<TCPSocket:(closed)> (Net::ReadTimeout)
from C:/Ruby31-x64/lib/ruby/3.1.0/net/protocol.rb:193:in `readuntil'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/protocol.rb:203:in `readline'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/http/response.rb:42:in `read_status_line'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/http/response.rb:31:in `read_new'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/http.rb:1609:in `block in transport_request'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/http.rb:1600:in `catch'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/http.rb:1600:in `transport_request'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/http.rb:1573:in `request'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/http.rb:1566:in `block in request'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/http.rb:985:in `start'
from C:/Ruby31-x64/lib/ruby/3.1.0/net/http.rb:1564:in `request'
from C:/Ruby31-x64/lib/ruby/gems/3.1.0/gems/httparty-0.21.0/lib/httparty/request.rb:156:in `perform'
from C:/Ruby31-x64/lib/ruby/gems/3.1.0/gems/httparty-0.21.0/lib/httparty.rb:612:in `perform_request'
from C:/Ruby31-x64/lib/ruby/gems/3.1.0/gems/httparty-0.21.0/lib/httparty.rb:542:in `post'
from C:/Ruby31-x64/lib/ruby/gems/3.1.0/gems/httparty-0.21.0/lib/httparty.rb:649:in `post'
from C:/Ruby31-x64/lib/ruby/gems/3.1.0/gems/ruby-openai-3.7.0/lib/openai/client.rb:63:in `json_post'
from C:/Ruby31-x64/lib/ruby/gems/3.1.0/gems/ruby-openai-3.7.0/lib/openai/client.rb:11:in `chat'
from translate.rb:33:in `block in build_translation'
from translate.rb:28:in `each'
from translate.rb:28:in `build_translation'
from translate.rb:60:in `
(Yes, I’m on Windows, don’t hate.)
Given this was a quick and dirty program, I added the sleep call, but then, later, added the while errkeys.length > 0
loop, which should help recover from any network issues. I’ll probably remove the sleep in the future.
I signed up for a paid account because I was receiving “quota exceeded” messages. To their credit, they have some great billing features. I was able to limit my monthly spend to $10, an amount I feel comfortable with.
As I mentioned above, translating every message into Hindi using GPT-3.5 cost about $0.02. Well worth it.
I used GPT-3.5 because GPT-4 was only in beta when I wrote this code. I didn’t spend too much time mulling that over, but it would be interesting to see if GPT4 is materially better at this task.
Worries
Translating these messages was a great exploration of the power of the OpenAI API, but I think it was also a great illustration of this tweet.
I had to determine what the problem was, and how to get the data into the model, and how to pull it out. As Reid Hoffman says in Impromptu, GPT was a great undergraduate assistant, but no professor.
Could I have dumped the entire properties file into ChatGPT and asked for a translation? I tried a couple of times and it timed out. When I shortened the number of messages, I was unable to figure out how to get it to ignore comments in the file.
One of my other worries is around licensing. I’m not alone. This is prototype code running on my personal laptop and the license for all the localization properties files is Apache2. But even with that, I’m not sure my company would integrate this process given the unknown legal ramifications of using OpenAI GPT models.
In conclusion
OpenAI APIs expose large language models and make them easy to integrate into your application. They are a super powerful tool, but I’m not sure where they fit into the legal landscape. Where have we heard that before?
Definitely worth exploring more.