tf.strings.to

TensorFlow 1 version

View source on GitHub

Decodes each string into a sequence of code points with start offsets.

View aliases

Compat aliases for migration

See Migration guide for more details.

tf.compat.v1.strings.unicode_decode_with_offsets

tf.strings.unicode_decode_with_offsets(
    input, input_encoding, errors='replace', replacement_char=65533,
    replace_control_characters=False, name=None
)

Used in the notebooks

Used in the tutorials
Unicode strings

This op is similar to tf.strings.decode(...), but it also returns the start offset for each character in its respective string. This information can be used to align the characters with the original byte sequence.

Returns a tuple (codepoints, start_offsets) where:

codepoints[i1...iN, j] is the Unicode codepoint for the jth character in input[i1...iN], when decoded using input_encoding.
start_offsets[i1...iN, j] is the start byte offset for the jth character in input[i1...iN], when decoded using input_encoding.

Args
`input`	An `N` dimensional potentially ragged `string` tensor with shape `[D1...DN]`. `N` must be statically known.
`input_encoding`	String name for the unicode encoding that should be used to decode each string.
`errors`	Specifies the response when an input string can't be converted using the indicated encoding. One of: `'strict'`: Raise an exception for any illegal substrings. `'replace'`: Replace illegal substrings with `replacement_char`. `'ignore'`: Skip illegal substrings.
`replacement_char`	The replacement codepoint to be used in place of invalid substrings in `input` when `errors='replace'`; and in place of C0 control characters in `input` when `replace_control_characters=True`.
`replace_control_characters`	Whether to replace the C0 control characters `(U+0000 - U+001F)` with the `replacement_char`.
`name`	A name for the operation (optional).

Returns
A tuple of `N+1` dimensional tensors `(codepoints, start_offsets)`. `codepoints` is an `int32` tensor with shape `[D1...DN, (num_chars)]`. `offsets` is an `int64` tensor with shape `[D1...DN, (num_chars)]`. The returned tensors are `tf.Tensor`s if `input` is a scalar, or `tf.RaggedTensor`s otherwise.

Returns

A tuple of N+1 dimensional tensors (codepoints, start_offsets).

codepoints is an int32 tensor with shape [D1...DN, (num_chars)].
offsets is an int64 tensor with shape [D1...DN, (num_chars)].

The returned tensors are tf.Tensors if input is a scalar, or tf.RaggedTensors otherwise.

Example:

input = [s.encode('utf8') for s in (u'G\xf6\xf6dnight', u'\U0001f60a')]
result = tf.strings.unicode_decode_with_offsets(input, 'UTF-8')
result[0].to_list()  # codepoints
[[71, 246, 246, 100, 110, 105, 103, 104, 116], [128522]]
result[1].to_list()  # offsets
[[0, 1, 3, 5, 6, 7, 8, 9, 10], [0]]

TensorFlow

tf

tf.audio

tf.autograph

tf.bitwise

tf.compat

tf.config

tf.data

tf.debugging

tf.distribute

tf.dtypes

tf.errors

tf.estimator

tf.experimental

tf.feature_column

tf.graph_util

tf.image

tf.initializers

tf.io

tf.keras

tf.linalg

tf.lite

tf.lookup

tf.losses

tf.math

tf.metrics

tf.nest

tf.nn

tf.optimizers

tf.quantization

tf.queue

tf.ragged

tf.random

tf.raw_ops

tf.saved_model

tf.sets

tf.signal

tf.sparse

tf.strings

tf.summary

tf.sysconfig

tf.test

tf.tpu

tf.train

tf.version

tf.xla

tf.strings / to_number

View aliases

Used in the notebooks

Args

Returns

Example: