Skip to content

Conversation

MikaelMayer
Copy link
Member

@MikaelMayer MikaelMayer commented Apr 18, 2025

This PR adds an opinionated way to write Dafny strings to files and read them back using UTF-8 encoding.
With this PR, it becomes finally possible to create command-line utilities that read a file and process it in an easy way.

What was changed?

  • Added FileIO.ReadUTF8FromFile and FileIO.WriteUTF8ToFile so that it's easier to read/write strings to text files within Dafny.
  • FromUTF8Checked, FromUTF16Checked and DecodeCodeUnitSequenceChecked all now return a Result instead of an Option so that the error message is clearer.

Example

For example, here is a tool that displays the content of a file

import opened Std.FileIO
method Main(args: seq<string>) {
  expect |args| >= 2;
  var result :- expect ReadUTF8FromFile(args[1]);
  print result.value;  
}

Usage:

dafny run --standard-libraries cat.dfy cat.dfy

It prints itself.

How has this been tested?

I added tests to read and to write strings similar to the existing tests to read and write UFT-8 characters.

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

MikaelMayer and others added 12 commits April 16, 2025 15:39
I added my own versions of FileIO.ReadFile and FileIO.WriteFile that I needed for my own application.
Fixes #6196 

I updated the tests to ensure it was breaking and that this PR fixes the
test.
Basically, a test was testing if the type was a bitvector type, but this
type was a synonym type with the new resolver so that needed to be
updated.

<small>By submitting this pull request, I confirm that my contribution
is made under the terms of the [MIT
license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>
Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely in favor of the feature!

if !Utf8EncodingForm.IsWellFormedCodeUnitSequence(bytes) {
return Failure("Byte sequence of file '" + fileName + "' is not well formed UTF8");
}
var x: seq<bv24> := Utf8EncodingForm.DecodeCodeUnitSequence(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid all the extra error checking by using UnicodeStringsWithUnicodeChar.FromUTF8Checked instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method only returns an Option. This error checking makes it possible to precisely report at which index there is an error. I think I prefer that. Later, we could still refactoring this error checking code if it's used somewhere else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay fair. It would be great to improve Unicode with more precise error messages as well at some point.

It does feel like you're reimplementing a fair bit of concepts from the Unicode module, and we have to be very careful to be consistent and sound.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For soundness, the good news is that all my code is only establishing the proof that I can call into the standard libraries, so we should be good for that.

var x: seq<bv24> := Utf8EncodingForm.DecodeCodeUnitSequence(bytes);
for i := 0 to |x|
invariant forall k | 0 <= k < i :: x[k] < 1 << 16 {
if !(x[i] < 1 << 16) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, in --unicode-char:true mode (which is the only mode the libraries currently support) valid characters go up to 0x10FFFF, so you may reject valid data here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I'd love to do that, when I run make update-standard-libraries with the change you suggested,

      invariant forall k | 0 <= k < i :: x[k] < 0x10FFFF {
      if !(x[i] < 0x10FFFF) {
        return Failure("index " + Strings.OfInt(i) + " is not a valid char");
      }

I get.

FileIO5-notarget-java-cs-js-go-py.dfy(93,54): Error: bit-vector value to be converted might not fit in char
   |
93 |     var s := seq(|x|, i requires 0 <= i < |x| => x[i] as char);
   |                                                       ^^

which is consistent with the verification condition we emit:

        var toWidth = 16;
        if (toWidth < fromWidth) {
          // Check "expr < (1 << toWidth)" in type "fromType" (note that "1 << toWidth" is indeed a value in "fromType")
          PutSourceIntoLocal();
          var toBound = BaseTypes.BigNum.FromBigInt(BigInteger.One << toWidth); // 1 << toWidth
          var bound = BplBvLiteralExpr(tok, toBound, fromType.AsBitVectorType);
          var boundsCheck = FunctionCall(expr.Origin, "lt_bv" + fromWidth, Bpl.Type.Bool, o, bound);
          var dafnyBound = new BinaryExpr(expr.Origin, BinaryExpr.Opcode.LeftShift, Expression.CreateIntLiteral(expr.Origin, 1), Expression.CreateIntLiteral(expr.Origin, toWidth));
          var dafnyBoundsCheck = new BinaryExpr(expr.Origin, BinaryExpr.Opcode.Lt, expr, dafnyBound);
          builder.Add(Assert(tok, boundsCheck, new ConversionFit("bit-vector value", toType, dafnyBoundsCheck, errorMsgPrefix), builder.Context));
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's because the correct check is to make sure the character isn't a surrogate as well. See here:

public static Expression MakeCharBoundsCheckUnicode(Expression expr) {

I think you could simplify and just test if x[i] is char?

if !Utf8EncodingForm.IsWellFormedCodeUnitSequence(bytes) {
return Failure("Byte sequence of file '" + fileName + "' is not well formed UTF8");
}
var x: seq<bv24> := Utf8EncodingForm.DecodeCodeUnitSequence(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay fair. It would be great to improve Unicode with more precise error messages as well at some point.

It does feel like you're reimplementing a fair bit of concepts from the Unicode module, and we have to be very careful to be consistent and sound.

Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for pushing the errors improvements into Unicode, love it! Just some cleanup before merging

*/
method ReadUTF8FromFile(fileName: string) returns (r: Result<string, string>) {
var bytes :- ReadBytesFromFile(fileName);
return UnicodeStringsWithUnicodeChar.FromUTF8Checked(seq(|bytes|, i requires 0 <= i < |bytes| => bytes[i] as uint8));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO much better, thank you! The rest of the declarations below aside from WriteUTF8ToFile are dead code and can be deleted now, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd mention the improvements to error messages in Unicode as well, partially because it's a great improvement but also because it's technically breaking. Perhaps mention how to unbreak affected code as well with Result.ToOption()

expect res.value == expectedStr;
}

// Failure path: attempting to read from a blank file path should never work.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second failure case where the bytes are not UTF8 would be very valuable too, especially since AFAICT we don't prove that the index is correct. Could also put it in UnicodeExamples.dfy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnicodeExamples already has a test for ill cases, so I just added 2 expect lines to demonstrate what to expect as an error message.

@@ -127,43 +128,45 @@ abstract module Std.Unicode.UnicodeEncodingForm {

/**
* Returns the unique partition of the given code unit sequence into minimal well-formed code unit subsequences,
* or None if no such partition exists.
* or Failure(CodeUnitSeq) if no such partition exists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific and say it returns the suffix that couldn't be partitioned? I had to read a fair bit of other code to figure that out.

Comment on lines 1 to 2
With `--standard-libraries` you can now read an UTF-8 text files from the disk using `Std.FileIO.ReadFile(path: string): Result<string, string>`.
To write some content to the disk, use `Std.FileIO.WriteFile(path: string, content: string): Outcome<string>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With `--standard-libraries` you can now read an UTF-8 text files from the disk using `Std.FileIO.ReadFile(path: string): Result<string, string>`.
To write some content to the disk, use `Std.FileIO.WriteFile(path: string, content: string): Outcome<string>`.
With `--standard-libraries` you can now read an UTF-8 text files from the disk using `Std.FileIO.ReadUTF8FromFile(path: string): Result<string, string>`.
To write some content to the disk, use `Std.FileIO.WriteUTF8ToFile(path: string, content: string): Outcome<string>`.

robin-aws
robin-aws previously approved these changes May 1, 2025
@MikaelMayer MikaelMayer enabled auto-merge (squash) May 2, 2025 12:53
@MikaelMayer MikaelMayer merged commit 5c84042 into master May 2, 2025
22 checks passed
@MikaelMayer MikaelMayer deleted the feat-fileio-read-write branch May 2, 2025 15:09
olivier-aws pushed a commit that referenced this pull request May 5, 2025
This PR adds an opinionated way to write Dafny strings to files and read
them back using UTF-8 encoding.
With this PR, it becomes finally possible to create command-line
utilities that read a file and process it in an easy way.

### What was changed?

- Added `FileIO.ReadUTF8FromFile` and `FileIO.WriteUTF8ToFile` so that
it's easier to read/write strings to text files within Dafny.
- `FromUTF8Checked`, `FromUTF16Checked` and
`DecodeCodeUnitSequenceChecked` all now return a `Result` instead of an
`Option` so that the error message is clearer.

### Example

For example, here is a tool that displays the content of a file
```
import opened Std.FileIO
method Main(args: seq<string>) {
  expect |args| >= 2;
  var result :- expect ReadUTF8FromFile(args[1]);
  print result.value;  
}
```

Usage:
```
dafny run --standard-libraries cat.dfy cat.dfy
```
It prints itself.


### How has this been tested?
I added tests to read and to write strings similar to the existing tests
to read and write UFT-8 characters.

<small>By submitting this pull request, I confirm that my contribution
is made under the terms of the [MIT
license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants