Quick Tip #7: Texas and Unicode things

[“Texas” as a cute name has been changed to the less offensive and more prosaic “ASCII”]

Perl 6 is more than Unicode “aware”—it reaches into the Unicode Character Database to use appropriate and meaningful characters. There use to be a joke that the only thing that kept Perl 5 from expanding was the lack of unused punctuation keys. That’s not a problem anymore.

But, for most Unicode thingys, there is an ASCII version. That version is probably multiple characters, so it’s larger than the Unicode version. And, since it’s a larger, super-sized version, we’ll call it the “Texas” version. The Perl 6 docs map the Unicode to ASCII versions.

I wrote a small command-line program to convert between the two. It’s not very sophisticated and I plan on improving it later. I’d especially like to look up things based on what they do. For instance, search for “subset” and get subset operators. Another data column with a description would be nice. And, reading all this from a file. Although I’ve taken the data from a single page in the Perl 6 docs, there are many other things I could add (such as the quoting stuff). I’ve also saved this in my unicode2texas.p6 gist.

use v6;

# https://docs.perl6.org/language/unicode_texas#Other_acceptable_single_codepoints
# https://gist.github.com/briandfoy/619638701f38817387f3e5d22d6dab12

my %hash;

BEGIN {  # Phasers on!
my @table = qw{
	«	U+00AB	<<
	»	U+00BB	>>
	×	U+00D7	*
	÷	U+00F7	/
	−	U+2212	-
	∘	U+2218	o
	≅	U+2245	=~=
	π	U+03C0	pi
	τ	U+03C4	tau
		U+1D452	e
	∞	U+221E	Inf
	…	U+2026	...
	‘	U+2018	'
	’	U+2019	'
	‚	U+201A	'
	“	U+201C	"
	”	U+201D	"
	„	U+201E	"
	「	U+FF62	Q//
	」	U+FF63	Q//
	⁺	U+207A	+
	⁻	U+207B	-
	¯	U+00AF	-
	⁰	U+2070	**0
	¹	U+2071	**1
	²	U+2072	**2
	³	U+2073	**3
	⁴	U+2074	**4
	⁵	U+2075	**5
	⁶	U+2076	**6
	⁷	U+2077	**7
	⁸	U+2078	**8
	⁹	U+2079	**9
	∘	U+2218	o
	∅	U+2205	set()
	∈	U+2208	(elem)
	∉	U+2209	!(elem)
	∋	U+220B	(cont)
	∌	U+220C	!(cont)
	⊆	U+2286	(<=)
	⊈	U+2288	!(<=)
	⊂	U+2282	(<)
	⊄	U+2284	!(<)
	⊇	U+2287	(>=)
	⊉	U+2289	!(>=)
	⊃	U+2283	(>)
	⊅	U+2285	!(>)
	≼	U+227C	(<+)
	≽	U+227D	(>+)
	∪	U+222A	(|)
	∩	U+2229	(&)
	∖	U+2216	(-)
	⊖	U+2296	(^)
	⊍	U+228D	(.)
	⊎	U+228E	(+)
	};

while ( @table #`(I could also read from a file) ) {
	# I'd really like a @table.shift(3);
	my ( $unicode, $codepoint, $texas ) = @table.splice( 0, 3 );
	# I don't particularly like the two way hash here.
	%hash{ $unicode, $texas } = %(
		unicode   => $unicode,
		texas     => $texas,
		codepoint => $codepoint
		) xx *;  # list replication
	}
}

# I don't really need a multi here, but I'm playing with it anyway.
# multi implies sub, so I could have said "multi sub MAIN"
multi MAIN( Str $s where { %hash{$_}:exists and .substr(0, 1).ord > 0xAA } ) {
	say "Running the Unicode version for $s";
	show( %hash{$s} );
	}

multi MAIN( Str $s where { %hash{$_}:exists and .substr(0, 1).ord <= 0xAA } ) {
	say "Running the Texas version for $s";
	show( %hash{$s} );
	}

sub show ( %h ) {
	say sprintf "Unicode: %s\nTexas: %s", %h<unicode texas>,
	}

Some notes on the program, which I didn’t spend much time thinking about. In the while I put an embedded comment #`(...). I think I want to create a data file to hold all of this.

I use a multi to define the same subroutine name several times but with different signatures. Inside each, I specify that they take a Str but I constrain them with where. Those code blocks get the parameter as $_ (or some other ways). That $_ is the default “topic”,. When I neglect to type out the object in .substr(0, 1), Perl 6 uses the topic.

And, I do a few hash slices here, but never change the sigil.

3 comments

  1. This utility works great in a shell that can handle Unicode!

    You can also use a `for` loop to iterate over that array, by the way:

        for @table -> $unicode, $codepoint, $texas
        {
                # I don't particularly like the two way hash here.
                %hash{ $unicode, $texas } = %(
                        unicode   => $unicode,
                        texas     => $texas,
                        codepoint => $codepoint
                        ) xx *;
                }
        }
    
  2. Is there some direct way to convert the Texas representation to the Unicode code point? I have tried quoting + .perl like this:

        say .perl
    

    but does not work, obviously.

    1. Are you trying to let P6 parse the code you wrote with Texas quotes and then deparse it with the fancier versions? I don’t know of a way to do that.

Leave a Reply to Christopher Bottoms Cancel reply

Your email address will not be published. Required fields are marked *