I'm currently building a gem for converting curl commands to something else, and one early problem I needed to deal with was parsing the curl command arguments.
The basic idea for a gem is to convert curl command to a sweet looking hash that we can use to do magical things. So when given:
curl https://isrubydead.com/
Our program converts it to:
{
"body": "",
"method": "GET",
"headers": {},
"url": "https://isrubydead.com/"
}
We'll call this hash a Request
, so let's define it:
Request = Struct.new(:body, :method, :headers, :url) do
def initialize
self.method = 'GET'
self.headers = {}
self.url = ''
self.body = ''
end
end
Our program should expect the curl command as a command line
argument, so let's use ARGV.first
to read the curl
command:
require 'shellwords'
module ParseCurl
class << self
def parse(command)
self.request = Request.new
self.parser = Parser.new(request)
parser.parse(Shellwords.split(command.strip))
request.to_h
end
private
attr_accessor :request
attr_accessor :parser
end
end
ParseCurl.parse(ARGV.first)
We use Shellwords
here to convert the curl command
to an array of tokens, to make it easier to parse.
Now that we have a place to store the result of parsing the curl command, and we are passing the command line argument to a parser, it's time to write an actual parser.
Being a very serious and very senior developer, I've ended up with this very appealing class:
class Parser
def initialize(request)
@request = request
end
def parse(words)
self.words = words
while (word = words.shift)
parse_word(word)
end
end
private
attr_accessor :words
attr_reader :request
def parse_word(word)
case word
when URI::DEFAULT_PARSER.make_regexp
request.url = word
when /\A(-H|--header)\Z/
parse_header
when /\A--compressed\Z/
request.headers['Accept-Encoding'] ||= 'deflate, gzip'
when /\A-X|\A--request\Z/
request.method = extract_method(word)
when /\A(-u|--user)\Z/
request.headers['Authorization'] ||=
"Basic #{Base64.strict_encode64(words.shift)}"
when /\A(-b|--cookie)\Z/
request.headers['Set-Cookie'] ||= words.shift
when /\A(-A|--user-agent)\Z/
request.headers['User-Agent'] ||= words.shift
when /\A(-I|--head)\Z/
request.method = 'HEAD'
when /\A(-d|--data|--data-ascii)\Z/
parse_data
when /\A--data-binary\Z/
parse_data(strip_newlines: false)
when /\A--data-raw\Z/
parse_data(raw: true)
when /\A--data-urlencode\Z/
parse_data(url_encode: true)
when /\./
request.url = "http://#{word}" if request.url == ''
end
end
end
I know, a thing a beauty.
What's also very nice is that this class also have to know everything about parsing curl arguments; from header and method arguments to various data arguments.
Super nice! Just look at these beautiful methods, that all belong to this class:
class Parser
private
def parse_data(**args)
request.method = 'POST' if request.method == 'GET' ||
request.method == 'HEAD'
request.headers['Content-Type'] ||=
'application/x-www-form-urlencoded'
data = extract_data(args)
request.body = if request.body.empty?
data
else
"#{request.body}&#{data}"
end
end
def parse_header
header = extract_header(words.shift)
return unless header
request.headers[header.keys.first] =
header.values.first
end
def extract_data(args)
data = words.shift
data = read_file(data, args) if data =~ /\A@/ || args[:raw]
data
end
def read_file(word,
strip_newlines: true,
raw: false,
url_encode: false)
path = raw ? word : word[1..-1]
content = File.read(path)
content.gsub!(/\n|\r/, '') if strip_newlines
content = CGI.escape(content) if url_encode
content
rescue Errno::ENOENT
file_path = File.expand_path(data[1..-1])
raise FileNotFoundError,
"Cannot read file: #{file_path}. Does it exist?"
end
# Needed for extracting -XPUT and similar formats.
def extract_method(word)
return words.shift unless word =~ /^-X/
if !(method = word.match(/^-X(.*)/)[1]).empty?
method
else
words.shift
end
end
def extract_header(header)
[header.split(/: (.+)/)].to_h
rescue StandardError
nil
end
end
Marvelous! This class knows absolutely everything about curl and all the arguments and options. We could ask it the curl's shoe size, and it would give us the answer.
Now that we have thought this class so much, and it has achieved the master's degree worth of education in the University of curl, I'm a little worried it's going to take over the world.
Sure, it only knows about curl for now, but what happens when someone decides it should know about, I don't know, some scary AI thing. It looks innocent right now, but then you turn around and it gets interested in doors.
I don't trust it one bit, so let's steal some knowledge from this class and let's spread it to other classes.
I'll now start explaining the refactoring step by step, but you can take a look at the final result if you'd like to skip.
This class will from now on be ignorant of curl arguments, so the humongous case statement becomes:
class Parser
def parse_word(word)
Word.parse(request, words, word)
end
end
Since every scenario we have to cover in the case statement consists of the regex check and the parsing part, the idea is to abstract those into class.
Let's introduce the Word::Base
class:
module Word
class Base
# Used to check if the word given matches the subclass' regex.
def self.===(word)
word =~ regex
end
def self.regex
raise NotImplementedError, "#{name} doesn't have .regex defined"
end
def parse(*)
raise NotImplementedError, "#{self.class.name} doesn't have #parse defined"
end
end
end
The plan is to have one base class, that will be inherited for every scenario we'd like to cover. Here's the first example of one such class:
module Word
class Header < Base
def self.regex
/\A(-H|--header)\Z/
end
def parse(*)
header = extract_header(words.shift)
return unless header
request.headers[header.keys.first] = header.values.first
end
private
def extract_header(header)
[header.split(/: (.+)/)].to_h
rescue StandardError
nil
end
end
end
This class stores information about everything that's needed to
parse a header argument. It detects the argument (when the curl command includes
-H
or --header
in arguments) and then changes the request hash,
which we return in the end.
Since this class has to know about words
and request
too, we'll add that to the base class. We expect other descendants
to need that too:
module Word
class Base
def initialize(words, request)
@words = words
@request = request
end
private
attr_reader :words
attr_reader :request
end
end
The next problem we have is that there's no connection between
the original parse_word
method and this new class we have created,
so let's write a method that connects those two:
module Word
def self.parse(request, words, word)
subclass = Word::Base.subclasses
.detect { |klass| klass === word }
subclass&.new(words, request)&.parse(word)
end
class Base
# List of subclasses, first-inherited class first.
def self.subclasses
ObjectSpace.each_object(Base.singleton_class)
.reject { |klass| klass == Base }.reverse
end
end
end
With this method we can find a Word::Base
subclass that matches the word
that was given and then parses that word. If we find the subclass,
we try to parse the word. So our current code works
for parsing --header
or -h
words from curl.
Our code now supports parsing headers, so let's proceed to add classes for every argument we'd like to support:
module Word
class Compressed < Base
def self.regex
/\A--compressed\Z/
end
def parse(*)
request.headers['Accept-Encoding'] ||= 'deflate, gzip'
end
end
class Request < Base
def self.regex
/\A-X|\A--request\Z/
end
def parse(word)
request.method = extract_method(word)
end
private
# Needed for extracting -XPUT and similar formats.
def extract_method(word)
return words.shift unless word =~ /^-X/
if !(method = word.match(/^-X(.*)/)[1]).empty?
method
else
words.shift
end
end
end
class Authorization < Base
def self.regex
/\A(-u|--user)\Z/
end
def parse(*)
request.headers['Authorization'] ||=
"Basic #{Base64.strict_encode64(words.shift)}"
end
end
class Cookie < Base
def self.regex
/\A(-b|--cookie)\Z/
end
def parse(*)
request.headers['Set-Cookie'] ||= words.shift
end
end
class UserAgent < Base
def self.regex
/\A(-A|--user-agent)\Z/
end
def parse(*)
request.headers['User-Agent'] ||= words.shift
end
end
class Head < Base
def self.regex
/\A(-I|--head)\Z/
end
def parse(*)
request.method = 'HEAD'
end
end
class Data < Base
def self.regex
/\A(-d|--data|--data-ascii)\Z/
end
def options
{}
end
def parse(*)
request.method = 'POST' if request.method == 'GET' ||
request.method == 'HEAD'
request.headers['Content-Type'] ||=
'application/x-www-form-urlencoded'
data = extract_data(options)
request.body = if request.body.empty?
data
else
"#{request.body}&#{data}"
end
end
def extract_data(args)
data = words.shift
data = read_file(data, args) if data =~ /\A@/ ||
args[:raw]
data
end
def read_file(word,
strip_newlines: true,
raw: false,
url_encode: false)
path = raw ? word : word[1..-1]
content = File.read(path)
content.gsub!(/\n|\r/, '') if strip_newlines
content = CGI.escape(content) if url_encode
content
rescue Errno::ENOENT
file_path = File.expand_path(data[1..-1])
raise FileNotFoundError,
"Cannot read file: #{file_path}. Does it exist?"
end
end
class StripNewlinesData < Data
def self.regex
/\A--data-binary\Z/
end
def options
{ strip_newlines: false }
end
end
class RawData < Data
def self.regex
/\A--data-raw\Z/
end
def options
{ raw: true }
end
end
class URLEncodeData < Data
def self.regex
/\A--data-urlencode\Z/
end
def options
{ url_encode: true }
end
end
class URL < Base
def self.===(word)
word =~ URI::DEFAULT_PARSER.make_regexp
end
def parse(word)
request.url = word
end
end
class NonStrictURL < Base
def self.regex
/\./
end
def parse(word)
request.url = "http://#{word}" if request.url == ''
end
end
end
Here's the final code we end up with:
module Word
def self.parse(request, words, word)
subclass = Word::Base.subclasses
.detect { |klass| klass === word }
subclass&.new(words, request)&.parse(word)
end
class Base
def initialize(words, request)
@words = words
@request = request
end
# List of subclasses, first-inherited class appears first.
def self.subclasses
ObjectSpace.each_object(Base.singleton_class)
.reject { |klass| klass == Base }.reverse
end
# Used to check if the word given matches the subclass' regex.
def self.===(word)
word =~ regex
end
def self.regex
raise NotImplementedError,
"#{name} doesn't have .regex defined"
end
def parse(*)
raise NotImplementedError,
"#{self.class.name} doesn't have #parse defined"
end
private
attr_reader :words
attr_reader :request
end
class Header < Base
def self.regex
/\A(-H|--header)\Z/
end
def parse(*)
header = extract_header(words.shift)
return unless header
request.headers[header.keys.first] = header.values.first
end
private
def extract_header(header)
[header.split(/: (.+)/)].to_h
rescue StandardError
nil
end
end
class Compressed < Base
def self.regex
/\A--compressed\Z/
end
def parse(*)
request.headers['Accept-Encoding'] ||= 'deflate, gzip'
end
end
class Request < Base
def self.regex
/\A-X|\A--request\Z/
end
def parse(word)
request.method = extract_method(word)
end
private
# Needed for extracting -XPUT and similar formats.
def extract_method(word)
return words.shift unless word =~ /^-X/
if !(method = word.match(/^-X(.*)/)[1]).empty?
method
else
words.shift
end
end
end
class Authorization < Base
def self.regex
/\A(-u|--user)\Z/
end
def parse(*)
request.headers['Authorization'] ||=
"Basic #{Base64.strict_encode64(words.shift)}"
end
end
class Cookie < Base
def self.regex
/\A(-b|--cookie)\Z/
end
def parse(*)
request.headers['Set-Cookie'] ||= words.shift
end
end
class UserAgent < Base
def self.regex
/\A(-A|--user-agent)\Z/
end
def parse(*)
request.headers['User-Agent'] ||= words.shift
end
end
class Head < Base
def self.regex
/\A(-I|--head)\Z/
end
def parse(*)
request.method = 'HEAD'
end
end
class Data < Base
def self.regex
/\A(-d|--data|--data-ascii)\Z/
end
def options
{}
end
def parse(*)
request.method = 'POST' if request.method == 'GET' ||
request.method == 'HEAD'
request.headers['Content-Type'] ||=
'application/x-www-form-urlencoded'
data = extract_data(options)
request.body = if request.body.empty?
data
else
"#{request.body}&#{data}"
end
end
def extract_data(args)
data = words.shift
data = read_file(data, args) if data =~ /\A@/ ||
args[:raw]
data
end
def read_file(word,
strip_newlines: true,
raw: false,
url_encode: false)
path = raw ? word : word[1..-1]
content = File.read(path)
content.gsub!(/\n|\r/, '') if strip_newlines
content = CGI.escape(content) if url_encode
content
rescue Errno::ENOENT
file_path = File.expand_path(data[1..-1])
raise FileNotFoundError,
"Cannot read file: #{file_path}. Does it exist?"
end
end
class StripNewlinesData < Data
def self.regex
/\A--data-binary\Z/
end
def options
{ strip_newlines: false }
end
end
class RawData < Data
def self.regex
/\A--data-raw\Z/
end
def options
{ raw: true }
end
end
class URLEncodeData < Data
def self.regex
/\A--data-urlencode\Z/
end
def options
{ url_encode: true }
end
end
class URL < Base
def self.===(word)
word =~ URI::DEFAULT_PARSER.make_regexp
end
def parse(word)
request.url = word
end
end
class NonStrictURL < Base
def self.regex
/\./
end
def parse(word)
request.url = "http://#{word}" if request.url == ''
end
end
end
class Parser
def initialize(request)
@request = request
end
def parse(words)
self.words = words
while (word = words.shift)
parse_word(word)
end
end
private
attr_accessor :words
attr_reader :request
def parse_word(word)
Word.parse(request, words, word)
end
end
module ParseCurl
class << self
def parse(command)
self.request = Request.new
self.parser = Parser.new(request)
parser.parse(Shellwords.split(command.strip))
request.to_h
end
private
attr_accessor :request
attr_accessor :parser
end
end
ParseCurl.parse(ARGV.first)
Note how each class contains knowledge specific to the argument we're trying to cover with it and nothing else. I think that's an improvement since it's still easy to support new arguments, and we don't pollute a single class with many methods.